Categories

CUC (6) CUCM (27) Jabber (6) Python (2) Routing (3) Solarwinds Orion NPM (4) switching (1) Video (6) voice (2)

Sunday, 27 July 2014

Robots.txt why and how? Not used for access control

This post is all about how websites get indexed by search engines. Before I start talking about robots.txt. I will need to do a bit of explaining around sitemaps. 

Sitemaps are what they say they are; an easy exploration/index of your site.  Sitemaps are an xml file containing the URLs that a search engine will use to index the contents of your website.   Because there potentially is a whole load of stuff at the back end of your website, such as an online shop, a database, or stacks of word press plug ins, themes and other demo stuff. You want control over what gets indexed and what doesn't.

I prefer to create a sitemap using an online tool such as http://www.web-site-map.com/
When you run this, you will notice all URL's on your website are indexed. Including test pages, orphaned links all sorts of shizzle you dont want Google to know about.

This is where the robot.txt file comes in. The robot.txt file is used to give a set of instructions to web robots. Figure 1 below is an example of such a file, used to block certain parts of a wordpress made webpage.

Fig. 1 Web robots.txt through Webmaster Tools

The user-agent: * means that this robot file should be used by all robots crawling the site.  The Disallow: statements below it, basically dictate the robot not to indez anything under these URLs  (/ start from the root of the domain in question)The robot.txt file itself should set right in the root of your webpage. An easy way to check if the file actually works is to upload the robot.txt file to your website and test it as per above, through the Google Webmaster Tools.

There are two important considerations when using robots.txt:
  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
So don't try to use /robots.txt to hide information.

Check these guys for more mo fills examples:

http://www.robotstxt.org/robotstxt.html


Namaste!

No comments:

Post a Comment