Sunday, July 29, 2007

The robots.txt file and search engine optimization



On how to tell the search engine spiders and crawlers which directories and files to include, and which to avoid.

Search engines find your web pages and files by sending out robots (also called bots, spiders or crawlers) that follow the links found on your site, read the pages they find and store the content in the search engine databases.

Dan Crow of Google puts it this way: “Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot “crawls” the web.”

But you may have directories and files you would prefer the search engine robots not to index. You may, for instance, have different versions of the same text, and you would like to tell the search engines which is the authoritative one (see: How to avoid duplicate content in search engine promotion).

How do you stop the robots?

the robots.txt file

If you are serious about search engine optimization you should make use of the Robots Exclusion Standard adding a robots.txt file to the root of you domain.

By using the robots.txt file you can tell the search engines what directories and files they should spider and include in their search results, and what directories and files to avoid.

This file must be uploaded to the root accessible directory of your site, not to a sub directory. Hence Pandia’s robots.txt file is found at http://www.pandia.com/robots.txt.

Plain ASCII please!

robots.txt should be a plain ASCII text file.

Use a text editor or text HTML editor to write it, not word processors like Word.

Pandia’s robots.txt file gives a good example of an uncomplicated file of this type:

User-agent: *
Disallow: /ads/
Disallow: /banners/
Disallow: /cgi-local/
Disallow: /cgi-script/
Disallow: /graphics/

The first line tells the robots which robots are to follow the “commands” given below this line. In this case the commands are for all search engines.
The next lines tells the robots which Pandia directories to avoid (disallow).

Lets take a closer look at the syntax for disallowing directories and files.

Blocking an entire site

To block the entire site, you include a forward slash, like this.

Disallow: /

This is not a procedure we recommend! If you want to block search engine spiders from crawling your site, you should make it password protected. The search engines have been known not to respect the robots.txt files from time to time.

Blocking directories

To block a directory and all its files, put a slash in front of and after the directory name.

Disallow: /images/
Disallow: /private/photos/

Blocking single files

To stop the search engine(s) from including one file, write the file name after a slash, like this:

Disallow: /private_file.html

If the file is found in a subdirectory, use the following syntax:

Disallow: /private/conflict.html

Note that there are no trailing slashes in these instances.

Note also that the URLs are case sensitive. /letters/ToMum.html is not the same as /letters/tomum.html!

Identifying robots

The first line User-agent: * says that the the following lines are for all robots.

You may also make different rules for different robots, like this:

User-agent: Googlebot
Disallow: /graphics/

Most web sites do not need to identify the different robots or crawlers in this way.

These are the names of the most common “bots”:
Googlebot (for Google web search)
Slurp (for Yahoo! web search)
msnbot (for Live Search web search)
Teoma (for Ask web search)

Source : http://www.pandia.com/sew/489-robots-txt.html

unavailable_after tag - Google Robots Exclusion Protocol

The ‘unavailable_after’ meta tag will soon be recognized by Google according to Dan Crow, Director of Crawl Systems at Google. from Loren Baker

Google is coming out with a new tag called “unavailable_after” which will allow people to tell Google when a particular page will no longer be available for crawling. For instance, if you have a special offer on your site that expires on a particular date, you might want to use the unavailable_after tag to let Google know when to stop indexing it. Or perhaps you write articles that are free for a particular amount of time, but then get moved to a paid-subscription area of your site.

Two new features added to the protocol will help webmasters govern
when an item should stop showing up in Google’s web search, as well
as providing some control over the indexing of other data
types.

One of the features, support for the unavailable_after tag, has
been mentioned previously. Google’s Dan Crow made that initial
disclosure.

He has followed that up with a full-fledged post on the official
Google blog about the new tag. The unavailable_after META tag
informs the Googlebot when a page should be removed from Google’s
search results:

“This information is treated as a removal request: it will take
about a day after the removal date passes for the page to disappear
from the search results. We currently only support unavailable_after
for Google web search results.”

“After the removal, the page stops showing in Google search results
but it is not removed from our system.”
(Email from: David A. Utter)

One of the major issues plaguing search engines right now is the growing list of web documents available online. While no exact numbers are available, there are billions of search results to sort through. But, they can’t all be relevant both on material content and time — can they?

Of course they’re not, and Google is hoping to solve this problem through the adoption of the unavailable_after META tag. more here
(From Sujan Patel: SEO Impact of Google’s unavailable_after META Tag)

Source :http://www.searchengineoptimizationcompany.ca
Google
 
Zeus Internet Marketing Robot
NNXH
Cyber-Robotics - ZEUS INTERNET MARKETING ROBOT