The Web Robots FAQ
Original of this document is here http://info.webcrawler.com/mak/projects/robots/faq.html
Send suggestions and comments to
Martijn Koster.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.
Published by New Riders, ISBN 1-56205-463-5.
The William's book 'Bots and other Internet Beasties' was quit disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together.
Published by Sam's, ISBN: 1-57521-016-9
While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.
Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...
Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.
If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.
If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)
but its easy to be more selective than that.User-agent: * Disallow: /
The first two lines, starting with '#', specify a comment# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression.
Two common errors:
The basic idea is that if you include a tag like:
in your HTML document, that document won't be indexed.<META NAME="ROBOTS" CONTENT="NOINDEX">
If you do:
the links in that document will not be parsed by the robot.<META NAME="ROBOTS" CONTENT="NOFOLLOW">
In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.
Alternatively check out the libwww-perl5 package, that has a simple example.
, ( know-how ). , , (wanderers, spiders, robots - , ) , .
, , , :
lycosidae.lycos.com - - [01/Mar/1997:21:27:32 -0500] "GET /robots.txt HTTP/1.0" 404 -
lycosidae.lycos.com - - [01/Mar/1997:21:27:39 -0500] "GET / HTTP/1.0" 200 3270
Lycos , , /robots.txt , , . , , .
, "" , . , . Standart for Robot Exclusion.
(Louis Monier, Altavista), 5% /robots.txt ( ) . , Lycos. (Charles P.Kollar, Lycos) , 6% /robots.txt 200. , :
/robots.txt (spiders) , , .. , /robots.txt. 0 , ( agent_id), . , /robots.txt, Product Token User-Agent, HTTP- . , Lycos User-Agent:
Lycos_Spider_(Rex)/1.0 libwww/3.1
Lycos /robots.txt - , . Lycos "" /robots.txt - , .
/robots.txt - . , , , /robots.txt . /robots.txt:
.
[ # comment string NL ]*
User-Agent: [ [ WS ]+ agent_id ]+ [ [ WS ]* # comment string ]? NL
[ # comment string NL ]*
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
[
# comment string NL
|
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
]*
[ NL ]+
, /robots.txt
[...]+ + , .
, "User-Agent:" agent_id.
[...]* * , .
, .
[...]? ? , .
, "User-Agent: agent_id" .
..|.. , , , .
WS - (011) (040)
NL - (015) , (012) (Enter)
User-Agent: ( ).
agent_id .
Disallow: ( ).
# , comment string - .
agent_id , WS NL, agent_id . * .
path_root , WS NL, , .
User-Agent, . : Disallow. . (lines). . . # . User-Agent Disallow. # , , agent_id path_root . User-Agent agent_id, path_root Disallow . User-Agent Disallow . /robots.txt agent_id, /robots.txt.
, .
User-Agent: *
/robots.txt agent_id, .
URL /robots.txt. path_root .
1:
User-Agent: *
Disallow: /
User-Agent: Lycos
Disallow: /cgi-bin/ /tmp/
1 /robots.txt . . Lycos /cgi-bin/ /tmp/, - . Lycos.
2:
User-Agent: Copernicus Fred
Disallow:
User-Agent: * Rex
Disallow: /t
2 /robots.txt . Copernicus Fred . - Rex , /tmp/, /tea-time/, /top-cat.txt, /traverse.this .. .
3:
# This is for every spider!
User-Agent: *
# stay away from this
Disallow: /spiders/not/here/ #and everything in it
Disallow: # a little nothing
Disallow: #This could be habit forming!
# Don't comments make code much more readable!!!
3 - . /spiders/not/here/, /spiders/not/here/really/, /spiders/not/here/yes/even/me.html. /spiders/not/ /spiders/not/her ( '/spiders/not/').
(Standart for Robot Exclusion).
, , , , .. , .
.
Internet, , . , /robots.txt , .
/robots.txt.
Altavista, Excite, Infoseek, Lycos, OpenText WebCrawler.
, (Excite, Infoseek, Lycos, Opentext WebCrawler) Distributing Indexing Workshop (W3C) , .
- HTML , . :
, /robots.txt -. HTML-, ( /robots.txt).
<META NAME="ROBOTS" CONTENT="robot_terms">
robot_terms -
( ): ALL, NONE,
INDEX, NOINDEX, FOLLOW, NOFOLLOW.
NONE - ( NOINDEX, NOFOLLOW).
ALL - ( INDEX, FOLLOW).
INDEX -
NOINDEX -
FOLLOW -
NOFOLLOW -
- robot_terms, robot_terms= INDEX, FOLLOW (.. ALL). CONTENT ALL, , .. CONTENT , , FOLLOW, NOFOLLOW, ( FOLLOW).
robot_terms NOINDEX, . robot_terms NOFOLLOW, , , , .
<META NAME="KEYWORDS" CONTENT="phrases">
phrases - ( ), (.. ). , , .
<META NAME="DESCRIPTION" CONTENT="text">
text - , . - .
-, "" . Altavista KEYWORDS -, Infoseek KEYWORDS DESCRIPTION -.
"" bookmark , . URL, bookmark. /robots.txt, , .
- DOCUMENT-STATE . , - CONTENT=STATIC.
<META NAME="DOCUMENT-STATE" CONTENT="STATIC">
<META NAME="DOCUMENT-STATE" CONTENT="DYNAMIC">
- , CGI-. , , . , , , . , - URL URL ( - ).
<META NAME="URL" CONTENT="absolute_url">
Martijn Koster , .
30 1994 robots-request@nexor.co.uk ( WebCrawler. . Robots pages at WebCrawler info.webcrawler.com/mak/projects/robots/) . Technical World Wide Web www-talk@info.cern.ch .
- , , . - .
info.webcrawler.com/mak/projects/robots/robots.html
(wanderers, spiders) - , - Internet.
1993 1994 , . , , . , , , "" , CGI-. .
, , . HTTP URL /robots.txt. . .
, , , . /robots.txt -.
URL :
/robots.txt :
(records), ( CR, CR/NL NL). (lines) :
"<field>:<optional_space><value><optional_space>".
<field> .
UNIX : # , - .
User-Agent, Disallow, . .
User-Agent