Search Engine Robots
Some Top Search Engines and their Robots
Google – Googlebot
Altavista – Scooter
AllTheWeb – FAST WebCrawler
MSN – MSNbot
Yahoo – Yahoo Slurp
Use of robots.txt File
Sometime your website has private pages which contain confidential data and you don’t want web indexing robots to crawl or index these pages. By adding robots.txt file you can prevent spiders to not crawl these pages. “Robots.txt” is a text file that you put in your site’s root directory naming it “robots” and giving extension “.txt”.
What to Write in Robots.txt File
User-agent: * //applies to all robots
Disallow: / //disallow indexing of all pages
The “User-agent: *” means this applies to all robots. The “Disallow: /” tell the robots that it should disallow indexing of all pages.
This piece of text gives complete access to the robots to visit all the pages of the website. If you don’t want to deny robots then robots.txt file can even be left empty.
By doing this you are preventing all robots to crawl a directory named “mydata”.
This will just disallow robot name “SearchBot” to access “mydata”.
Thus using robots.txt file you can get rid of bad bots to access your important information and allow the robots to crawl and index web pages.
Some Unwanted Robots
URL_Spider_Pro, EmailSiphon, EmailCollector, TeleportPro, ExtractorPro, DittoSpyder, NetAnts, Python-urllib, Website Quester, TheNomad, InfoNaviRobot, Hidden-Referrer, Openbot, FairAd Client, MSIECrawler, Flaming AttackBot, etc. These are some bad robots from which you should protect your web pages.