Custom Search

What Robots.txt is And Search Engine Robots Explained

You have a website and it has to be indexed by search engines so that users can find you through search engine search. This task is accomplished by search engine robots.

Search engine robot is also known as crawler or spider.

Search engine robots are also called search engine spiders, search engine crawlers, etc. These robots are human written software programs that can automatically and constantly visit millions of websites everyday and include what they find into search engine databases. This process is called crawling or spidering.

When a robot visits your site, the very first file it looks for is robots.txt file which should locate in your web root directory. That is the directory where your home page locates. For example,

Simply put, robots.txt gives you total control of

  1. which crawlers should visit your site.
  2. which part of your site should be visited and which part crawlers should stay away.

Using Robots.txt is not compulsory. If missing, search engine robots assume your entire site is allowed to be visited and indexed by any crawlers.

Search engine robots are automated crawling software that visit websites and travel the web via web links. Commercial robots follow The Robots Exclusion Protocol. Not all robots comply with the Protocol, but majorities do.

You can create robots.txt by using simple text editor such as Notepad. In this file, you control the following two things:

1. Which search engine robots can visit your site.
2. Which part of your site you allow search engine robots to visit and index.

robots.txt code examples:

If robots.txt file does not exist or the file exists but has empty content, it indicates that all robots are allowed to access any part of the site.

Note: in the following syntax, anything after # is comments. Robots ignore that part.

  1. Allow all robots to visit any part of your site. * means all robots.

    # This is for all spiders
    User-agent: *

  2. Disallow all robots to visit any part of your site.

    User-agent: *
    Disallow: /

  3. Disallow Google robot (named Googlebot) to visit and index a single file password-safe.html.

    User-agent: Googlebot
    Disallow: /article/password-safe.html

  4. Disallow all robots to visit any page in a particular directory. For example, images, you'd use the following robots.txt entry:

    User-agent: *
    Disallow: /images/

    Note: the above syntax is slightly different to the syntax below:

    User-agent: *
    Disallow: /images

    In addition to disallowing visit to any pages in directory images, this syntax also disallows visit to any files that uses images as its filename such as images.html

  5. Disallow only AltaVista's robot (named Scooter) to index your whole site.

    User-agent: Scooter
    Disallow: /

  6. Disallow Google's image robot named "Googlebot-Images" (without quotes) to index your images directory.

    User-agent: Googlebot-Images
    Disallow: /images/

More robots.txt tools and resources

  1. Robots.txt generator

    Search Google for phrase "Robots.txt generator" (without quotes) and you can find tons of online Robots.txt generators. These tools allow you to create robots.txt code without writing them by yourself but you do have to add the generated code into a text editor and create the file, and then upload to your web root directory.

  2. Robots.txt Validator

    There are many resources on the web to validate your robots.txt for syntax errors. Search Google for Robots.txt Validator and you will find a lot.

    There are many useful robots.txt resources on the site, including a Robots.txt forum on webmasterworld.

  3. Robot names

    Here is a list of the most commonly known web crawler names:

    Googlebot Google's web crawler
    Yahoo! Slurp Yahoo's web crawler
    MSNbot MSN Search web crawler
    Ask Jeeves/Teoma web crawler
    ia_archiver web crawler

    To see a rather comprehensive and updated often list of robot names, check out the award winning web page Search engine robots that visit your web site.

  4. Real world examples

  5. For more information in regard to robots.txt, check out information on robots meta tag governing body website

Ban bad web robots

Not all search engine robots are good. Anyone with good programming skills in languages such as C++, and scripting languages such as Python or PHP can write a robot program.

Because of its low Barriers to Entry, there are hundreds of web crawlers out on the internet. Some of these robots are written to fetch and download web pages of a full website onto computer hard drives for the purpose of plagiarism. These robots sometimes do not even obey the rules you set up in robots.txt. These spambots could seriously eat up your website's monthly bandwidth usage.

If you are concerned about spambots, you can set up bad robots trap and then ban them.

Read the following two articles for some great information:

  1. Stopping Spambots: A Spambot Trap

  2. Protect Your Web Server From Spambots

What If...

Robots.txt is great but it must be placed at your web root directory. What if you don't have permission to access the root directory? What if you want to ban robots to index a specific web page rather than a directory? The answer is to use page level robots control techniques on page by page basis.


Related Articles:

1.Absolute Path and Relative Path Explained
2.The Difference Between Dynamic URLs and Static URLs
3.Robots Meta HTML Tag Syntax Explained
4.What If You Don't Want Your Pages To Be Crawled and Cached by Search Engines
5.The Right Domain Name Drives More Website Traffic

Other Recent Articles from the Webmaster Help category:

1.How to load IP addresses for countries into your MySQL database
2.How to set up your website connection details in FileZilla
3.The Difference Between Dynamic URLs and Static URLs
4.How To Find Out Everything You Want To Know About A Website
5.Robots Meta HTML Tag Syntax Explained
6.What If You Don't Want Your Pages To Be Crawled and Cached by Search Engines
7.How to Tweak HTML Table To Speed Up Page Load Time

Copyright © 2017 All Rights Reserved.

This website is hosted by HostGator.

No portion may be reproduced without my written permission. Software and hardware names mentioned on this site are registered trademarks of their respective companies. Should any right be infringed, it is totally unintentional. Drop me an email and I will promptly and gladly rectify it.

Home | Feedback | Terms of Use | Privacy Policy