Showing posts with label crawlers. Show all posts
Showing posts with label crawlers. Show all posts

Sunday, July 29, 2007

The robots.txt file and search engine optimization



On how to tell the search engine spiders and crawlers which directories and files to include, and which to avoid.

Search engines find your web pages and files by sending out robots (also called bots, spiders or crawlers) that follow the links found on your site, read the pages they find and store the content in the search engine databases.

Dan Crow of Google puts it this way: “Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot “crawls” the web.”

But you may have directories and files you would prefer the search engine robots not to index. You may, for instance, have different versions of the same text, and you would like to tell the search engines which is the authoritative one (see: How to avoid duplicate content in search engine promotion).

How do you stop the robots?

the robots.txt file

If you are serious about search engine optimization you should make use of the Robots Exclusion Standard adding a robots.txt file to the root of you domain.

By using the robots.txt file you can tell the search engines what directories and files they should spider and include in their search results, and what directories and files to avoid.

This file must be uploaded to the root accessible directory of your site, not to a sub directory. Hence Pandia’s robots.txt file is found at http://www.pandia.com/robots.txt.

Plain ASCII please!

robots.txt should be a plain ASCII text file.

Use a text editor or text HTML editor to write it, not word processors like Word.

Pandia’s robots.txt file gives a good example of an uncomplicated file of this type:

User-agent: *
Disallow: /ads/
Disallow: /banners/
Disallow: /cgi-local/
Disallow: /cgi-script/
Disallow: /graphics/

The first line tells the robots which robots are to follow the “commands” given below this line. In this case the commands are for all search engines.
The next lines tells the robots which Pandia directories to avoid (disallow).

Lets take a closer look at the syntax for disallowing directories and files.

Blocking an entire site

To block the entire site, you include a forward slash, like this.

Disallow: /

This is not a procedure we recommend! If you want to block search engine spiders from crawling your site, you should make it password protected. The search engines have been known not to respect the robots.txt files from time to time.

Blocking directories

To block a directory and all its files, put a slash in front of and after the directory name.

Disallow: /images/
Disallow: /private/photos/

Blocking single files

To stop the search engine(s) from including one file, write the file name after a slash, like this:

Disallow: /private_file.html

If the file is found in a subdirectory, use the following syntax:

Disallow: /private/conflict.html

Note that there are no trailing slashes in these instances.

Note also that the URLs are case sensitive. /letters/ToMum.html is not the same as /letters/tomum.html!

Identifying robots

The first line User-agent: * says that the the following lines are for all robots.

You may also make different rules for different robots, like this:

User-agent: Googlebot
Disallow: /graphics/

Most web sites do not need to identify the different robots or crawlers in this way.

These are the names of the most common “bots”:
Googlebot (for Google web search)
Slurp (for Yahoo! web search)
msnbot (for Live Search web search)
Teoma (for Ask web search)

Source : http://www.pandia.com/sew/489-robots-txt.html
Google
 
Zeus Internet Marketing Robot
NNXH
Cyber-Robotics - ZEUS INTERNET MARKETING ROBOT