Search engines visit websites, blogs and other online portal to scan and
then store or cache the information. This is then used to rank them
according to the relevancy. In a large portal the whole website may be
indexed according to criteria of search engines. Most of the search
engine used indexed web pages collectively to rank the site. This means
pages that are not important or hold information not meant for the
engines are also indexed.
In order to specify to the robots which pages to index and which to ignore a common protocol robot.txt is used. Hence a file is uploaded into the server along with the pages in the root directory. This is the robot.txt
file which the robots first visit and index the pages accordingly. But
remember this protocol may not be adhered to by some robots especially
those with malicious intent. But all the popular search companies adhere
to this standard which is public.
The file contains the following command/commands:
User-agent: *
Disallow: /
Disallow: /
More instructions are given below:
To block the entire site, use a forward slash.
Disallow: /
To block a directory and everything in it, follow the directory name with a forward slash.
Disallow: /junk-directory/
To block a page, list the page.
Disallow: /private_file.html
To remove a specific image from Google Images, add the following:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /
To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
Disallow: /
To block a directory and everything in it, follow the directory name with a forward slash.
Disallow: /junk-directory/
To block a page, list the page.
Disallow: /private_file.html
To remove a specific image from Google Images, add the following:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /
To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
More information can be had from: Here Robot.TXT
Also visit Robottxt.org
This information can also be given in the meta robot tag which should be
present in every page. Another methodology is to insert x-robot header
this is placed in the header which then works for all pages. Care should
be taken that directives are proper and no important pages is barred.
This also applies to CMS portals.
You have to regulary check the directives so that over time no
misinformation/ wrong case is inserted. There are number of tools that
can enable you to keep things in line.