Robots.txt File for Controlling Search Engine Robots

Search Engine Robots

Search engines such as Google, AllTheWeb, AltaVista, Inktomi, yahoo, Bing and others use robots also called as spiders or crawlers to read and index web pages in search engine databases. Search engine robots reads html (visible text on the page) part of the web page, they are not expert to read and understand JavaScript, frames, images or Flash used.

                                                                                                                                                                                                                                                                             robotstxt

Some Top Search Engines and their Robots

Google – Googlebot
Altavista – Scooter
AllTheWeb – FAST WebCrawler
MSN – MSNbot
Yahoo – Yahoo Slurp

robots_spider_crawler

Use of robots.txt File

Sometime your website has private pages which contain confidential data and you don’t want web indexing robots to crawl or index these pages. By adding robots.txt file you can prevent spiders to not crawl these pages. “Robots.txt” is a text file that you put in your site’s root directory naming it “robots” and giving extension “.txt”.

What to Write in Robots.txt File

User-agent: * //applies to all robots

Disallow: / //disallow indexing of all pages

The “User-agent: *” means this applies to all robots. The “Disallow: /” tell the robots that it should disallow indexing of all pages.

  User-agent: *
  Disallow:
 

This piece of text gives complete access to the robots to visit all the pages of the website. If you don’t want to deny robots then robots.txt file can even be left empty.

  User-agent: *
  Disallow: /mydata/

By doing this you are preventing all robots to crawl a directory named “mydata”.

  User-agent: SearchBot
  Disallow: /mydata/

This will just disallow robot name “SearchBot” to access “mydata”.

Thus using robots.txt file you can get rid of bad bots to access your important information and allow the robots to crawl and index web pages.

Some Unwanted Robots

URL_Spider_Pro, EmailSiphon, EmailCollector, TeleportPro, ExtractorPro, DittoSpyder, NetAnts, Python-urllib, Website Quester, TheNomad, InfoNaviRobot, Hidden-Referrer, Openbot, FairAd Client, MSIECrawler, Flaming AttackBot, etc. These are some bad robots from which you should protect your web pages.

Tags: , , ,