Web site Design and Search Engine Optimization

 

 

 

 

What is a robots.txt file?

 

  • A robots.txt file. Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines. This file tells crawlers which directories they can or cannot crawl. Make sure it's current for your site so that you don't accidentally block the robots.
  • This file is placed on the root directory of your site. The following can be found at the top of Google.com's robot.txt file (http://www.google.co.uk/robots.txt)
    User-agent: *
    Allow: /searchhistory/
    Disallow: /news?output=xhtml&
    Allow: /news?output=xhtml
    Disallow: /search
    Disallow: /groups
    Disallow: /images
    Disallow: /catalogs
    Disallow: /catalogues
    Disallow: /news
      User-agent: *  indicates that the information is for all robots.

      Disallow: /news   tells the robots not to index any page in the /news directory.

    If you have 90,000 pages on your Web site then it is unlikely that a search engine will index them all. You may find that those that are critical are excluded in preference to insignificant pages.

     

     

    How to check your robots.txt file

    Open your web browser and enter www.yourdomain.com/robots.txt to view the contents of your robots txt file. Here are the most important tips for a correct robots.txt file:

     

    1.  There are only two official commands for the robots.txt file: User-agent and Disallow. Do not use more commands than these.

    2.   Don't change the order of the commands. Start with the user-agent line and then add the disallow commands:

    User-agent: *
    Disallow: /cgi-bin/

    3.   Don't use more than one directory in a Disallow line. "Disallow: /support /cgi-bin/ /images/" does not work. Use an extra Disallow line for every directory:

    User-agent: *
    Disallow: /support
    Disallow: /cgi-bin/
    Disallow: /images/

    4.    Be sure to use the right case. The file names on your server are case sensitve. If the name of your directory is "Support", don't write "support" in the robots.txt file.

    You can find user agent names in your log files by checking for requests to robots.txt. Usually, all search engine spiders should be given the same rights. To do that, use User-agent: * in your robots.txt file.

    What happens if you don't have a robots.txt file?

    If your Web site doesn't have a robots.txt file (you can check this by entering your www.yourdomain.com/robotx.txt in your web browser) then search engines will automatically index everything they can find on your site.