network security and privacy considerations, search engines follow the robots.txt protocol. By creating a text file in the root directory of the robots.txt website, you can declare not to be part of the visit robots. Each site can control whether the site to be included in search engines, or specify the search engine included only the specified content. When a search engine crawlers visit a web site, it will first check whether robots.txt exists the site root directory, if the file does not exist, then the crawler crawl along the link, if present, crawler will be in accordance with the contents of the file to determine the access scope.
Altavista spider: scooter
AlltheWeb spider: fast-webcrawler
User-agent: * * here represents all kinds of search engines, the * is a wildcard
In view of the
baby baby bot
Yahoo spider: slurp
We used the
MSN spider: msnbot
search engine type:
search engine through a spider crawler program (also known as robot, search, search spider robot), automatically collect the webpage on Internet and access to relevant information.
What is the robots.txt file
robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.
Lycos: lycos_spider_ (T-Rex)
spider spider: noble aristocracy
Allow: definition allows search engines address
User-agent: definition of search engine
love of spiders in Shanghai: baiduspider
site can be indexed by search engines, in addition to see if there is no entrance to the search engine submission, whether exchange links with other sites, will have to see the root directory under the robots.txt file does not prohibit your search engine here from some on the robots.txt file written memo.
Disallow: defined against search engine included
inktomi spider: slurp
Alexa spider: ia_archiver
robots.txt file format
robots.txt file written in