The Web Robots FAQ Original of this document is here Ą http://info.webcrawler.com/mak/projects/robots/faq.html These frequently asked questions about Web robots.
Send suggestions and comments to Martijn Koster.

About WWW robots
Indexing robots
For Server Administrators
Robots exclusion standard
Availability

About Web Robots

What is a WWW robot?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

What is an agent?

The word "agent" is used for lots of meanings in computing these days. Specifically:

Autonomous agents: are programs that do travel between sites, deciding themselves when to move and what to do (e.g. General Magic's Telescript). These can only travel between special servers and are currently not widespread in the Internet.
Intelligent agents: are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.
User-agent: is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Explorer, Email User-agent like Qualcomm Eudora etc.

What is a search engine?

A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.

What other kinds of robots are there?

Robots can be used for a number of purposes:

Indexing
HTML validation
Link validation
"What's New" monitoring
Mirroring

See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list...

So what are Robots, Spiders, Web Crawlers, Worms, Ants

They're all names for the same sort of thing, with slightly different connotations:

Robots: the generic name, see above.
Spiders: same as robots, but sounds cooler in the press.
Worms: same as robots, although technically a worm is a replicating program, unlike a robot.
Web crawlers: same as robots, but note WebCrawler is a specific robot
WebAnts: distributed cooperating robots.

Aren't robots bad for the web?

There are a few reasons people believe robots are bad for the Web:

Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects
Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites.

But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.

So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.

Are there any robot books?

Yes:

Internet Agents: Spiders, Wanderers, Brokers, and Bots by Fah-Chun Cheong.

This books covers Web robots, commerce transaction agents, Mud agents, and a few others. It includes source code for a simple Web robot based on top of libwww-perl4.

Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.

Published by New Riders, ISBN 1-56205-463-5.

Bots and Other Internet Beasties by Joseph Williams

I haven't seen this myself, but someone said: The William's book 'Bots and other Internet Beasties' was quit disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together.

Published by Sam's, ISBN: 1-57521-016-9

Web Client Programming with Perl by Clinton Wong

This O'Reilly book is planned for Fall 1996, check the O'Reilly Web Site for the current status. It promises to be a practical book, but I haven't seen it yet.

A few others can be found on the The Software Agents Mailing List FAQ

Where do I find out more about robots?

There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html

While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.

Of course the latest version of this FAQ is there.

You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.

Indexing robots

How does a robot decide where to visit?

This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.

Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.

Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.

Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.

How does an indexing robot decide what to index?

If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags.

We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...

How do I register my page with a robot?

You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page.

Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.

For Server Administrators

How do I know if I've been visited by a robot?

You can check your server logs for sites that retrieve many documents, especially in a short time.

If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.

Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.

I've been visited by a robot! Now what?

Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.

If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!

A robot is traversing my whole site too fast!

This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.

First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.

However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.

If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.

If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.

How do I keep a robot off my server?

Read the next section...

Robots exclusion standard

Why do I find entries for /robots.txt in my log files?

They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below.

If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.

Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)

How do I prevent robots scanning my site?

The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: *
Disallow: /

but its easy to be more selective than that.

Where do I find out how /robots.txt files work?

You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism

User-agent: webcrawler
Disallow:

User-agent: lycra
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first two lines, starting with '#', specify a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression.

Two common errors:

Regular expressions are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'.
You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)

Will the /robots.txt standard be extended?

Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress.

What if I can't make a /robots.txt file?

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

The basic idea is that if you include a tag like:

<META NAME="ROBOTS" CONTENT="NOINDEX">

in your HTML document, that document won't be indexed.

If you do:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

the links in that document will not be parsed by the robot.

Availability

Where can I use a robot?

If you mean a search service, check out the various directory pages on the Web, such as Netscape's Exploring the Net or try one of the Meta search services such as MetaSearch

Where can I get a robot?

Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly.

In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.

Where can I get the source code for a robot?

See above -- some may be willing to give out source code.

Alternatively check out the libwww-perl5 package, that has a simple example.

I'm writing a robot, what do I need to be careful of?

Lots. First read through all the stuff on the robot page then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-)

I've written a robot, how do I list it?

Simply fill in a form you can find on The Web Robots Database and email it to me.

Martijn Koster

îĹÓËĎĚŘËĎ ÓĚĎ× Ď ÔĎÍ, ËÁË ŇÁÂĎÔÁŔÔ ŇĎÂĎÔŮ (spiders) ĐĎÉÓËĎ×ŮČ ÍÁŰÉÎ áÎÄŇĹĘ áĚÉËÂĹŇĎ×, ăĹÎÔŇ éÎĆĎŇÍÁĂÉĎÎÎŮČ ôĹČÎĎĚĎÇÉĘ

÷×ĹÄĹÎÉĹ
ROBOTS ÍĹÔÁ-ÔÁÇÉ

÷×ĹÄĹÎÉĹ

üÔÁ ÓÔÁÔŘŃ ×Ď×ÓĹ ÎĹ Ń×ĚŃĹÔÓŃ ĐĎĐŮÔËĎĘ ĎÂßŃÓÎÉÔŘ, ËÁË ŇÁÂĎÔÁŔÔ ĐĎÉÓËĎ×ŮĹ ÍÁŰÉÎŮ ×ĎĎÂÝĹ (ÜÔĎ know-how ÉČ ĐŇĎÉÚ×ĎÄÉÔĹĚĹĘ). ďÄÎÁËĎ, ĐĎ ÍĎĹÍŐ ÍÎĹÎÉŔ, ĎÎÁ ĐĎÍĎÖĹÔ ĐĎÎŃÔŘ ËÁË ÍĎÖÎĎ ŐĐŇÁ×ĚŃÔŘ ĐĎ×ĹÄĹÎÉĹÍ ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ× (wanderers, spiders, robots - ĐŇĎÇŇÁÍÍŮ, Ó ĐĎÍĎÝŘŔ ËĎÔĎŇŮČ ÔÁ ÉĚÉ ÉÎÁŃ ĐĎÉÓËĎ×ÁŃ ÓÉÓÔĹÍÁ ĎÂŰÁŇÉ×ÁĹÔ ÓĹÔŘ É ÉÎÄĹËÓÉŇŐĹÔ ×ÓÔŇĹŢÁŔÝÉĹÓŃ ÄĎËŐÍĹÎÔŮ) É ËÁË ĐŇÁ×ÉĚŘÎĎ ĐĎÓÔŇĎÉÔŘ ÓÔŇŐËÔŐŇŐ ÓĹŇ×ĹŇÁ É ÓĎÄĹŇÖÁÝÉČÓŃ ÎÁ ÎĹÍ ÄĎËŐÍĹÎÔĎ×, ŢÔĎÂŮ ÷ÁŰ ÓĹŇ×ĹŇ ĚĹÇËĎ É ČĎŇĎŰĎ ÉÎÄĹËÓÉŇĎ×ÁĚÓŃ.

đĹŇ×ĎĘ ĐŇÉŢÉÎĎĘ ÔĎÇĎ, ŢÔĎ Ń ŇĹŰÉĚÓŃ ÎÁĐÉÓÁÔŘ ÜÔŐ ÓÔÁÔŘŔ, Ń×ÉĚÓŃ ÓĚŐŢÁĘ, ËĎÇÄÁ Ń ÉÓÓĚĹÄĎ×ÁĚ ĆÁĘĚ ĚĎÇĎ× ÄĎÓÔŐĐÁ Ë ÍĎĹÍŐ ÓĹŇ×ĹŇŐ É ĎÂÎÁŇŐÖÉĚ ÔÁÍ ÓĚĹÄŐŔÝÉĹ Ä×Ĺ ÓÔŇĎËÉ:

lycosidae.lycos.com - - [01/Mar/1997:21:27:32 -0500] "GET /robots.txt HTTP/1.0" 404 -
lycosidae.lycos.com - - [01/Mar/1997:21:27:39 -0500] "GET / HTTP/1.0" 200 3270

ÔĎ ĹÓÔŘ Lycos ĎÂŇÁÔÉĚÓŃ Ë ÍĎĹÍŐ ÓĹŇ×ĹŇŐ, ÎÁ ĐĹŇ×ŮĘ ÚÁĐŇĎÓ ĐĎĚŐŢÉĚ, ŢÔĎ ĆÁĘĚÁ /robots.txt ÎĹÔ, ĎÂÎŔČÁĚ ĐĹŇ×ŐŔ ÓÔŇÁÎÉĂŐ, É ĎÔ×ÁĚÉĚ. ĺÓÔĹÓÔ×ĹÎÎĎ, ÍÎĹ ÜÔĎ ÎĹ ĐĎÎŇÁ×ÉĚĎÓŘ, É Ń ÎÁŢÁĚ ×ŮŃÓÎŃÔŘ ŢÔĎ Ë ŢĹÍŐ.

ďËÁÚŮ×ÁĹÔÓŃ, ×ÓĹ "ŐÍÎŮĹ" ĐĎÉÓËĎ×ŮĹ ÍÁŰÉÎŮ ÓÎÁŢÁĚÁ ĎÂŇÁÝÁŔÔÓŃ Ë ÜÔĎÍŐ ĆÁĘĚŐ, ËĎÔĎŇŮĘ ÄĎĚÖĹÎ ĐŇÉÓŐÔÓÔ×Ď×ÁÔŘ ÎÁ ËÁÖÄĎÍ ÓĹŇ×ĹŇĹ. üÔĎÔ ĆÁĘĚ ĎĐÉÓŮ×ÁĹÔ ĐŇÁ×Á ÄĎÓÔŐĐÁ ÄĚŃ ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ×, ĐŇÉŢĹÍ ÓŐÝĹÓÔ×ŐĹÔ ×ĎÚÍĎÖÎĎÓÔŘ ŐËÁÚÁÔŘ ÄĚŃ ŇÁÚĚÉŢÎŮČ ŇĎÂĎÔĎ× ŇÁÚÎŮĹ ĐŇÁ×Á. äĚŃ ÎĹÇĎ ÓŐÝĹÓÔ×ŐĹÔ ÓÔÁÎÄÁŇÔ ĐĎÄ ÎÁÚ×ÁÎÉĹÍ Standart for Robot Exclusion.

đĎ ÍÎĹÎÉŔ ěŐÉÓÁ íĎÎŘĹ (Louis Monier, Altavista), ÔĎĚŘËĎ 5% ×ÓĹČ ÓÁĘÔĎ× × ÎÁÓÔĎŃÝĹĹ ×ŇĹÍŃ ÉÍĹĹÔ ÎĹ ĐŐÓÔŮĹ ĆÁĘĚŮ /robots.txt ĹÓĚÉ ×ĎĎÂÝĹ ĎÎÉ (ÜÔÉ ĆÁĘĚŮ) ÔÁÍ ÓŐÝĹÓÔ×ŐŔÔ. üÔĎ ĐĎÄÔ×ĹŇÖÄÁĹÔÓŃ ÉÎĆĎŇÍÁĂÉĹĘ, ÓĎÂŇÁÎÎĎĘ ĐŇÉ ÎĹÄÁ×ÎĹÍ ÉÓÓĚĹÄĎ×ÁÎÉÉ ĚĎÇĎ× ŇÁÂĎÔŮ ŇĎÂĎÔÁ Lycos. űÁŇĚŘ ëĎĚĚÁŇ (Charles P.Kollar, Lycos) ĐÉŰĹÔ, ŢÔĎ ÔĎĚŘËĎ 6% ĎÔ ×ÓĹČ ÚÁĐŇĎÓĎ× ÎÁ ĐŇĹÄÍĹÔ /robots.txt ÉÍĹŔÔ ËĎÄ ŇĹÚŐĚŘÔÁÔÁ 200. ÷ĎÔ ÎĹÓËĎĚŘËĎ ĐŇÉŢÉÎ, ĐĎ ËĎÔĎŇŮÍ ÜÔĎ ĐŇĎÉÓČĎÄÉÔ:

ĚŔÄÉ, ËĎÔĎŇŮĹ ŐÓÔÁÎÁ×ĚÉ×ÁŔÔ ÷ĹÂ-ÓĹŇ×ĹŇÁ, ĐŇĎÓÔĎ ÎĹ ÚÎÁŔÔ ÎÉ ĎÂ ÜÔĎÍ ÓÔÁÎÄÁŇÔĹ, ÎÉ Ď ÎĹĎÂČĎÄÉÍĎÓÔÉ ÓŐÝĹÓÔ×Ď×ÁÎÉŃ ĆÁĘĚÁ /robots.txt.
ÎĹ ĎÂŃÚÁÔĹĚŘÎĎ ŢĹĚĎ×ĹË, ÉÎÓÔÁĚĚÉŇĎ×Á×ŰÉĘ ÷ĹÂ-ÓĹŇ×ĹŇ, ÚÁÎÉÍÁĹÔÓŃ ĹÇĎ ÎÁĐĎĚÎĹÎÉĹÍ, Á ÔĎÔ, ËÔĎ Ń×ĚŃĹÔÓŃ ×ĹÂÍÁÓÔĹŇĎÍ, ÎĹ ÉÍĹĹÔ ÄĎĚÖÎĎÇĎ ËĎÎÔÁËÔÁ Ó ÁÄÍÉÎÉÓÔŇÁÔĎŇĎÍ ÓÁÍĎĘ "ÖĹĚĹÚŃËÉ".
ÜÔĎ ŢÉÓĚĎ ĎÔŇÁÖÁĹÔ ŢÉÓĚĎ ÓÁĘÔĎ×, ËĎÔĎŇŮĹ ÄĹĘÓÔ×ÉÔĹĚŘÎĎ ÎŐÖÄÁŔÔÓŃ × ÉÓËĚŔŢĹÎÉÉ ĚÉŰÎÉČ ÚÁĐŇĎÓĎ× ŇĎÂĎÔĎ×, ĐĎÓËĎĚŘËŐ ÎĹ ÎÁ ×ÓĹČ ÓĹŇ×ĹŇÁČ ÉÍĹĹÔÓŃ ÔÁËĎĘ ÓŐÝĹÓÔ×ĹÎÎŮĘ ÔŇÁĆÉË, ĐŇÉ ËĎÔĎŇĎÍ ĐĎÓĹÝĹÎÉĹ ÓĹŇ×ĹŇÁ ĐĎÉÓËĎ×ŮÍ ŇĎÂĎÔĎÍ, ÓÔÁÎĎ×ÉÔÓŃ ÚÁÍĹÔÎŮÍ ÄĚŃ ĐŇĎÓÔŮČ ĐĎĚŘÚĎ×ÁÔĹĚĹĘ.

ćĎŇÍÁÔ ĆÁĘĚÁ /robots.txt.

ćÁĘĚ /robots.txt ĐŇĹÄÎÁÚÎÁŢĹÎ ÄĚŃ ŐËÁÚÁÎÉŃ ×ÓĹÍ ĐĎÉÓËĎ×ŮÍ ŇĎÂĎÔÁÍ (spiders) ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÉÎĆĎŇÍÁĂÉĎÎÎŮĹ ÓĹŇ×ĹŇÁ ÔÁË, ËÁË ĎĐŇĹÄĹĚĹÎĎ × ÜÔĎÍ ĆÁĘĚĹ, Ô.Ĺ. ÔĎĚŘËĎ ÔĹ ÄÉŇĹËÔĎŇÉÉ É ĆÁĘĚŮ ÓĹŇ×ĹŇÁ, ËĎÔĎŇŮĹ îĺ ĎĐÉÓÁÎŮ × /robots.txt. üÔĎ ĆÁĘĚ ÄĎĚÖĹÎ ÓĎÄĹŇÖÁÔŘ 0 ÉĚÉ ÂĎĚĹĹ ÚÁĐÉÓĹĘ, ËĎÔĎŇŮĹ Ó×ŃÚÁÎŮ Ó ÔĹÍ ÉĚÉ ÉÎŮÍ ŇĎÂĎÔĎÍ (ŢÔĎ ĎĐŇĹÄĹĚŃĹÔÓŃ ÚÎÁŢĹÎÉĹÍ ĐĎĚŃ agent_id), É ŐËÁÚŮ×ÁŔÔ ÄĚŃ ËÁÖÄĎÇĎ ŇĎÂĎÔÁ ÉĚÉ ÄĚŃ ×ÓĹČ ÓŇÁÚŐ ŢÔĎ ÉÍĹÎÎĎ ÉÍ îĺ îáäď ÉÎÄĹËÓÉŇĎ×ÁÔŘ. ôĎÔ, ËÔĎ ĐÉŰĹÔ ĆÁĘĚ /robots.txt, ÄĎĚÖĹÎ ŐËÁÚÁÔŘ ĐĎÄÓÔŇĎËŐ Product Token ĐĎĚŃ User-Agent, ËĎÔĎŇŐŔ ËÁÖÄŮĘ ŇĎÂĎÔ ×ŮÄÁĹÔ ÎÁ HTTP-ÚÁĐŇĎÓ ÉÎÄĹËÓÉŇŐĹÍĎÇĎ ÓĹŇ×ĹŇÁ. îÁĐŇÉÍĹŇ, ÎŮÎĹŰÎÉĘ ŇĎÂĎÔ Lycos ÎÁ ÔÁËĎĘ ÚÁĐŇĎÓ ×ŮÄÁĹÔ × ËÁŢĹÓÔ×Ĺ ĐĎĚŃ User-Agent:

	Lycos_Spider_(Rex)/1.0 libwww/3.1

ĺÓĚÉ ŇĎÂĎÔ Lycos ÎĹ ÎÁŰĹĚ Ó×ĎĹÇĎ ĎĐÉÓÁÎÉŃ × /robots.txt - ĎÎ ĐĎÓÔŐĐÁĹÔ ÔÁË, ËÁË ÓŢÉÔÁĹÔ ÎŐÖÎŮÍ. ëÁË ÔĎĚŘËĎ ŇĎÂĎÔ Lycos "Ő×ÉÄĹĚ" × ĆÁĘĚĹ /robots.txt ĎĐÉÓÁÎÉĹ ÄĚŃ ÓĹÂŃ - ĎÎ ĐĎÓÔŐĐÁĹÔ ÔÁË, ËÁË ĹÍŐ ĐŇĹÄĐÉÓÁÎĎ.

đŇÉ ÓĎÚÄÁÎÉÉ ĆÁĘĚÁ /robots.txt ÓĚĹÄŐĹÔ ŐŢÉÔŮ×ÁÔŘ ĹÝĹ ĎÄÉÎ ĆÁËÔĎŇ - ŇÁÚÍĹŇ ĆÁĘĚÁ. đĎÓËĎĚŘËŐ ĎĐÉÓŮ×ÁĹÔÓŃ ËÁÖÄŮĘ ĆÁĘĚ, ËĎÔĎŇŮĘ ÎĹ ÓĚĹÄŐĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ, ÄÁ ĹÝĹ ÄĚŃ ÍÎĎÇÉČ ÔÉĐĎ× ŇĎÂĎÔĎ× ĎÔÄĹĚŘÎĎ, ĐŇÉ ÂĎĚŘŰĎÍ ËĎĚÉŢĹÓÔ×Ĺ ÎĹ ĐĎÄĚĹÖÁÝÉČ ÉÎÄĹËÓÉŇĎ×ÁÎÉŔ ĆÁĘĚĎ× ŇÁÚÍĹŇ /robots.txt ÓÔÁÎĎ×ÉÔÓŃ ÓĚÉŰËĎÍ ÂĎĚŘŰÉÍ. ÷ ÜÔĎÍ ÓĚŐŢÁĹ ÓĚĹÄŐĹÔ ĐŇÉÍĹÎŃÔŘ ĎÄÉÎ ÉĚÉ ÎĹÓËĎĚŘËĎ ÓĚĹÄŐŔÝÉČ ÓĐĎÓĎÂĎ× ÓĎËŇÁÝĹÎÉŃ ŇÁÚÍĹŇÁ /robots.txt:

ŐËÁÚŮ×ÁÔŘ ÄÉŇĹËÔĎŇÉŔ, ËĎÔĎŇŐŔ ÎĹ ÓĚĹÄŐĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ, É, ÓĎĎÔ×ĹÔÓÔ×ĹÎÎĎ, ÎĹ ĐĎÄĚĹÖÁÝÉĹ ÉÎÄĹËÓÉŇĎ×ÁÎÉŔ ĆÁĘĚŮ ŇÁÓĐĎĚÁÇÁÔŘ ÉÍĹÎÎĎ × ÎĹĘ
ÓĎÚÄÁ×ÁÔŘ ÓÔŇŐËÔŐŇŐ ÓĹŇ×ĹŇÁ Ó ŐŢĹÔĎÍ ŐĐŇĎÝĹÎÉŃ ĎĐÉÓÁÎÉŃ ÉÓËĚŔŢĹÎÉĘ × /robots.txt
ŐËÁÚŮ×ÁÔŘ ĎÄÉÎ ÓĐĎÓĎÂ ÉÎÄĹËÓÉŇĎ×ÁÎÉŃ ÄĚŃ ×ÓĹČ agent_id
ŐËÁÚŮ×ÁÔŘ ÍÁÓËÉ ÄĚŃ ÄÉŇĹËÔĎŇÉĘ É ĆÁĘĚĎ×

úÁĐÉÓÉ (records) ĆÁĘĚÁ /robots.txt

ďÂÝĹĹ ĎĐÉÓÁÎÉĹ ĆĎŇÍÁÔÁ ÚÁĐÉÓÉ.

[ # comment string NL ]*

User-Agent: [ [ WS ]+ agent_id ]+ [ [ WS ]* # comment string ]? NL

[ # comment string NL ]*

Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL

[

# comment string NL

|

Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL

]*

[ NL ]+

đÁŇÁÍĹÔŇŮ

ďĐÉÓÁÎÉĹ ĐÁŇÁÍĹÔŇĎ×, ĐŇÉÍĹÎŃĹÍŮČ × ÚÁĐÉÓŃČ /robots.txt

[...]+ ë×ÁÄŇÁÔÎŮĹ ÓËĎÂËÉ ÓĎ ÓĚĹÄŐŔÝÉÍ ÚÁ ÎÉÍÉ ÚÎÁËĎÍ + ĎÚÎÁŢÁŔÔ, ŢÔĎ × ËÁŢĹÓÔ×Ĺ ĐÁŇÁÍĹÔŇĎ× ÄĎĚÖÎŮ ÂŮÔŘ ŐËÁÚÁÎŮ ĎÄÉÎ ÉĚÉ ÎĹÓËĎĚŘËĎ ÔĹŇÍÉÎĎ×.

îÁĐŇÉÍĹŇ, ĐĎÓĚĹ "User-Agent:" ŢĹŇĹÚ ĐŇĎÂĹĚ ÍĎÇŐÔ ÂŮÔŘ ŐËÁÚÁÎŮ ĎÄÉÎ ÉĚÉ ÎĹÓËĎĚŘËĎ agent_id.

[...]* ë×ÁÄŇÁÔÎŮĹ ÓËĎÂËÉ ÓĎ ÓĚĹÄŐŔÝÉÍ ÚÁ ÎÉÍÉ ÚÎÁËĎÍ * ĎÚÎÁŢÁŔÔ, ŢÔĎ × ËÁŢĹÓÔ×Ĺ ĐÁŇÁÍĹÔŇĎ× ÍĎÇŐÔ ÂŮÔŘ ŐËÁÚÁÎŮ ÎĎĚŘ ÉĚÉ ÎĹÓËĎĚŘËĎ ÔĹŇÍÉÎĎ×.

îÁĐŇÉÍĹŇ, ÷Ů ÍĎÖĹÔĹ ĐÉÓÁÔŘ ÉĚÉ ÎĹ ĐÉÓÁÔŘ ËĎÍÍĹÎÔÁŇÉÉ.

[...]? ë×ÁÄŇÁÔÎŮĹ ÓËĎÂËÉ ÓĎ ÓĚĹÄŐŔÝÉÍ ÚÁ ÎÉÍÉ ÚÎÁËĎÍ ? ĎÚÎÁŢÁŔÔ, ŢÔĎ × ËÁŢĹÓÔ×Ĺ ĐÁŇÁÍĹÔŇĎ× ÍĎÇŐÔ ÂŮÔŘ ŐËÁÚÁÎŮ ÎĎĚŘ ÉĚÉ ĎÄÉÎ ÔĹŇÍÉÎ.

îÁĐŇÉÍĹŇ, ĐĎÓĚĹ "User-Agent: agent_id" ÍĎÖĹÔ ÂŮÔŘ ÎÁĐÉÓÁÎ ËĎÍÍĹÎÔÁŇÉĘ.

..|.. ĎÚÎÁŢÁĹÔ ÉĚÉ ÔĎ, ŢÔĎ ÄĎ ŢĹŇÔŮ, ÉĚÉ ÔĎ, ŢÔĎ ĐĎÓĚĹ.

WS ĎÄÉÎ ÉÚ ÓÉÍ×ĎĚĎ× - ĐŇĎÂĹĚ (011) ÉĚÉ ÔÁÂŐĚŃĂÉŃ (040)

NL ĎÄÉÎ ÉÚ ÓÉÍ×ĎĚĎ× - ËĎÎĹĂ ÓÔŇĎËÉ (015) , ×ĎÚ×ŇÁÔ ËÁŇĹÔËÉ (012) ÉĚÉ ĎÂÁ ÜÔÉČ ÓÉÍ×ĎĚÁ (Enter)

User-Agent: ËĚŔŢĹ×ĎĹ ÓĚĎ×Ď (ÚÁÇĚÁ×ÎŮĹ É ĐŇĎĐÉÓÎŮĹ ÂŐË×Ů ŇĎĚÉ ÎĹ ÉÇŇÁŔÔ).

đÁŇÁÍĹÔŇÁÍÉ Ń×ĚŃŔÔÓŃ agent_id ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ×.

Disallow: ËĚŔŢĹ×ĎĹ ÓĚĎ×Ď (ÚÁÇĚÁ×ÎŮĹ É ĐŇĎĐÉÓÎŮĹ ÂŐË×Ů ŇĎĚÉ ÎĹ ÉÇŇÁŔÔ).

đÁŇÁÍĹÔŇÁÍÉ Ń×ĚŃŔÔÓŃ ĐĎĚÎŮĹ ĐŐÔÉ Ë ÎĹÉÎÄĹËÓÉŇŐĹÍŮÍ ĆÁĘĚÁÍ ÉĚÉ ÄÉŇĹËÔĎŇÉŃÍ

# ÎÁŢÁĚĎ ÓÔŇĎËÉ ËĎÍÍĹÎÔÁŇÉĹ×, comment string - ÓĎÂÓÔ×ĹÎÎĎ ÔĹĚĎ ËĎÍÍĹÎÔÁŇÉŃ.

agent_id ĚŔÂĎĹ ËĎĚÉŢĹÓÔ×Ď ÓÉÍ×ĎĚĎ×, ÎĹ ×ËĚŔŢÁŔÝÉČ WS É NL, ËĎÔĎŇŮĹ ĎĐŇĹÄĹĚŃŔÔ agent_id ŇÁÚĚÉŢÎŮČ ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ×. úÎÁË * ĎĐŇĹÄĹĚŃĹÔ ×ÓĹČ ŇĎÂĎÔĎ× ÓŇÁÚŐ.

path_root ĚŔÂĎĹ ËĎĚÉŢĹÓÔ×Ď ÓÉÍ×ĎĚĎ×, ÎĹ ×ËĚŔŢÁŔÝÉČ WS É NL, ËĎÔĎŇŮĹ ĎĐŇĹÄĹĚŃŔÔ ĆÁĘĚŮ É ÄÉŇĹËÔĎŇÉÉ, ÎĹ ĐĎÄĚĹÖÁÝÉĹ ÉÎÄĹËÓÉŇĎ×ÁÎÉŔ.

ňÁÓŰÉŇĹÎÎŮĹ ËĎÍÍĹÎÔÁŇÉÉ ĆĎŇÍÁÔÁ.

ëÁÖÄÁŃ ÚÁĐÉÓŘ ÎÁŢÉÎÁĹÔÓŃ ÓĎ ÓÔŇĎËÉ User-Agent, × ËĎÔĎŇĎĘ ĎĐÉÓŮ×ÁĹÔÓŃ ËÁËÉÍ ÉĚÉ ËÁËĎÍŐ ĐĎÉÓËĎ×ĎÍŐ ŇĎÂĎÔŐ ÜÔÁ ÚÁĐÉÓŘ ĐŇĹÄÎÁÚÎÁŢÁĹÔÓŃ. óĚĹÄŐŔÝÁŃ ÓÔŇĎËÁ: Disallow. úÄĹÓŘ ĎĐÉÓŮ×ÁŔÔÓŃ ÎĹ ĐĎÄĚĹÖÁÝÉĹ ÉÎÄĹËÓÁĂÉÉ ĐŐÔÉ É ĆÁĘĚŮ. ëáöäáń ÚÁĐÉÓŘ äďěöîá ÉÍĹÔŘ ËÁË ÍÉÎÉÍŐÍ ÜÔÉ Ä×Ĺ ÓÔŇĎËÉ (lines). ÷ÓĹ ĎÓÔÁĚŘÎŮĹ ÓÔŇĎËÉ Ń×ĚŃŔÔÓŃ ĎĐĂÉŃÍÉ. úÁĐÉÓŘ ÍĎÖĹÔ ÓĎÄĹŇÖÁÔŘ ĚŔÂĎĹ ËĎĚÉŢĹÓÔ×Ď ÓÔŇĎË ËĎÍÍĹÎÔÁŇÉĹ×. ëÁÖÄÁŃ ÓÔŇĎËÁ ËĎÍÍĹÎÔÁŇÉŃ ÄĎĚÖÎÁ ÎÁŢÉÎÁÔŘÓŃ Ó ÓÉÍ×ĎĚÁ # . óÔŇĎËÉ ËĎÍÍĹÎÔÁŇÉĹ× ÍĎÇŐÔ ÂŮÔŘ ĐĎÍĹÝĹÎŮ × ËĎÎĹĂ ÓÔŇĎË User-Agent É Disallow. óÉÍ×ĎĚ # × ËĎÎĂĹ ÜÔÉČ ÓÔŇĎË ÉÎĎÇÄÁ ÄĎÂÁ×ĚŃĹÔÓŃ ÄĚŃ ÔĎÇĎ, ŢÔĎÂŮ ŐËÁÚÁÔŘ ĐĎÉÓËĎ×ĎÍŐ ŇĎÂĎÔŐ, ŢÔĎ ÄĚÉÎÎÁŃ ÓÔŇĎËÁ agent_id ÉĚÉ path_root ÚÁËĎÎŢĹÎÁ. ĺÓĚÉ × ÓÔŇĎËĹ User-Agent ŐËÁÚÁÎĎ ÎĹÓËĎĚŘËĎ agent_id, ÔĎ ŐÓĚĎ×ÉĹ path_root × ÓÔŇĎËĹ Disallow ÂŐÄĹÔ ×ŮĐĎĚÎĹÎĎ ÄĚŃ ×ÓĹČ ĎÄÉÎÁËĎ×Ď. ďÇŇÁÎÉŢĹÎÉĘ ÎÁ ÄĚÉÎŐ ÓÔŇĎË User-Agent É Disallow ÎĹÔ. ĺÓĚÉ ĐĎÉÓËĎ×ŮĘ ŇĎÂĎÔ ÎĹ ĎÂÎÁŇŐÖÉĚ × ĆÁĘĚĹ /robots.txt Ó×ĎĹÇĎ agent_id, ÔĎ ĎÎ ÉÇÎĎŇÉŇŐĹÔ /robots.txt.

ĺÓĚÉ ÎĹ ŐŢÉÔŮ×ÁÔŘ ÓĐĹĂÉĆÉËŐ ŇÁÂĎÔŮ ËÁÖÄĎÇĎ ĐĎÉÓËĎ×ĎÇĎ ŇĎÂĎÔÁ, ÍĎÖÎĎ ŐËÁÚÁÔŘ ÉÓËĚŔŢĹÎÉŃ ÄĚŃ ×ÓĹČ ŇĎÂĎÔĎ× ÓŇÁÚŐ. üÔĎ ÄĎÓÔÉÇÁĹÔÓŃ ÚÁÄÁÎÉĹÍ ÓÔŇĎËÉ

	User-Agent: *

ĺÓĚÉ ĐĎÉÓËĎ×ŮĘ ŇĎÂĎÔ ĎÂÎÁŇŐÖÉÔ × ĆÁĘĚĹ /robots.txt ÎĹÓËĎĚŘËĎ ÚÁĐÉÓĹĘ Ó ŐÄĎ×ĚĹÔ×ĎŇŃŔÝÉÍ ĹÇĎ ÚÎÁŢĹÎÉĹÍ agent_id, ÔĎ ŇĎÂĎÔ ×ĎĚĹÎ ×ŮÂÉŇÁÔŘ ĚŔÂŐŔ ÉÚ ÎÉČ.

ëÁÖÄŮĘ ĐĎÉÓËĎ×ŮĘ ŇĎÂĎÔ ÂŐÄĹÔ ĎĐŇĹÄĹĚŃÔŘ ÁÂÓĎĚŔÔÎŮĘ URL ÄĚŃ ŢÔĹÎÉŃ Ó ÓĹŇ×ĹŇÁ Ó ÉÓĐĎĚŘÚĎ×ÁÎÉĹÍ ÚÁĐÉÓĹĘ /robots.txt. úÁÇĚÁ×ÎŮĹ É ÓÔŇĎŢÎŮĹ ÓÉÍ×ĎĚŮ × path_root éíĺŕô ÚÎÁŢĹÎÉĹ.

đŇÉÍĹŇŮ.

đŇÉÍĹŇ 1:

User-Agent: *

Disallow: /

User-Agent: Lycos

Disallow: /cgi-bin/ /tmp/

÷ ĐŇÉÍĹŇĹ 1 ĆÁĘĚ /robots.txt ÓĎÄĹŇÖÉÔ Ä×Ĺ ÚÁĐÉÓÉ. đĹŇ×ÁŃ ĎÔÎĎÓÉÔÓŃ ËĎ ×ÓĹÍ ĐĎÉÓËĎ×ŮÍ ŇĎÂĎÔÁÍ É ÚÁĐŇĹÝÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ×ÓĹ ĆÁĘĚŮ. ÷ÔĎŇÁŃ ĎÔÎĎÓÉÔÓŃ Ë ĐĎÉÓËĎ×ĎÍŐ ŇĎÂĎÔŐ Lycos É ĐŇÉ ÉÎÄĹËÓÉŇĎ×ÁÎÉÉ ÉÍ ÓĹŇ×ĹŇÁ ÚÁĐŇĹÝÁĹÔ ÄÉŇĹËÔĎŇÉÉ /cgi-bin/ É /tmp/, Á ĎÓÔÁĚŘÎŮĹ - ŇÁÚŇĹŰÁĹÔ. ôÁËÉÍ ĎÂŇÁÚĎÍ ÓĹŇ×ĹŇ ÂŐÄĹÔ ĐŇĎÉÎÄĹËÓÉŇĎ×ÁÎ ÔĎĚŘËĎ ÓÉÓÔĹÍĎĘ Lycos.

đŇÉÍĹŇ 2:

User-Agent: Copernicus Fred

Disallow:

User-Agent: * Rex

Disallow: /t

÷ ĐŇÉÍĹŇĹ 2 ĆÁĘĚ /robots.txt ÓĎÄĹŇÖÉÔ Ä×Ĺ ÚÁĐÉÓÉ. đĹŇ×ÁŃ ŇÁÚŇĹŰÁĹÔ ĐĎÉÓËĎ×ŮÍ ŇĎÂĎÔÁÍ Copernicus É Fred ÉÎÄĹËÓÉŇĎ×ÁÔŘ ×ĹÓŘ ÓĹŇ×ĹŇ. ÷ÔĎŇÁŃ - ÚÁĐŇĹÝÁĹÔ ×ÓĹÍ É ĎÓĹÂĹÎÎĎ ŇĎÂĎÔŐ Rex ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÔÁËÉĹ ÄÉŇĹËÔĎŇÉÉ É ĆÁĘĚŮ, ËÁË /tmp/, /tea-time/, /top-cat.txt, /traverse.this É Ô.Ä. üÔĎ ËÁË ŇÁÚ ÓĚŐŢÁĘ ÚÁÄÁÎÉŃ ÍÁÓËÉ ÄĚŃ ÄÉŇĹËÔĎŇÉĘ É ĆÁĘĚĎ×.

đŇÉÍĹŇ 3:

# This is for every spider!

User-Agent: *

# stay away from this

Disallow: /spiders/not/here/ #and everything in it

Disallow: # a little nothing

Disallow: #This could be habit forming!

# Don't comments make code much more readable!!!

÷ ĐŇÉÍĹŇĹ 3 - ĎÄÎÁ ÚÁĐÉÓŘ. úÄĹÓŘ ×ÓĹÍ ŇĎÂĎÔÁÍ ÚÁĐŇĹÝÁĹÔÓŃ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÄÉŇĹËÔĎŇÉŔ /spiders/not/here/, ×ËĚŔŢÁŃ ÔÁËÉĹ ĐŐÔÉ É ĆÁĘĚŮ ËÁË /spiders/not/here/really/, /spiders/not/here/yes/even/me.html. ďÄÎÁËĎ ÓŔÄÁ ÎĹ ×ČĎÄŃÔ /spiders/not/ ÉĚÉ /spiders/not/her (× ÄÉŇĹËÔĎŇÉÉ '/spiders/not/').

îĹËĎÔĎŇŮĹ ĐŇĎÂĚĹÍŮ, Ó×ŃÚÁÎÎŮĹ Ó ĐĎÉÓËĎ×ŮÍÉ ŇĎÂĎÔÁÍÉ.

îĹÚÁËĎÎŢĹÎÎĎÓÔŘ ÓÔÁÎÄÁŇÔÁ (Standart for Robot Exclusion).

ë ÓĎÖÁĚĹÎÉŔ, ĐĎÓËĎĚŘËŐ ĐĎÉÓËĎ×ŮĹ ÓÉÓÔĹÍŮ ĐĎŃ×ÉĚÉÓŘ ÎĹ ÔÁË ÄÁ×ÎĎ, ÓÔÁÎÄÁŇÔ ÄĚŃ ŇĎÂĎÔĎ× ÎÁČĎÄÉÔÓŃ × ÓÔÁÄÉÉ ŇÁÚŇÁÂĎÔËÉ, ÄĎŇÁÂĎÔËÉ, ÎŐ É Ô.Ä. üÔĎ ĎÚÎÁŢÁĹÔ, ŢÔĎ × ÂŐÄŐÝĹÍ ÓĎ×ÓĹÍ ÎĹĎÂŃÚÁÔĹĚŘÎĎ ĐĎÉÓËĎ×ŮĹ ÍÁŰÉÎŮ ÂŐÄŐÔ ÉÍ ŇŐËĎ×ĎÄÓÔ×Ď×ÁÔŘÓŃ.

ő×ĹĚÉŢĹÎÉĹ ÔŇÁĆÉËÁ.

üÔÁ ĐŇĎÂĚĹÍÁ ÎĹ ÓĚÉŰËĎÍ ÁËÔŐÁĚŘÎÁ ÄĚŃ ŇĎÓÓÉĘÓËĎÇĎ ÓĹËÔĎŇÁ Internet, ĐĎÓËĎĚŘËŐ ÎĹ ÔÁË ŐÖ ÍÎĎÇĎ × ňĎÓÓÉÉ ÓĹŇ×ĹŇĎ× Ó ÔÁËÉÍ ÓĹŇŘĹÚÎŮÍ ÔŇÁĆÉËĎÍ, ŢÔĎ ĐĎÓĹÝĹÎÉĹ ÉČ ĐĎÉÓËĎ×ŮÍ ŇĎÂĎÔĎÍ ÂŐÄĹÔ ÍĹŰÁÔŘ ĎÂŮŢÎŮÍ ĐĎĚŘÚĎ×ÁÔĹĚŃÍ. óĎÂÓÔ×ĹÎÎĎ, ĆÁĘĚ /robots.txt ÄĚŃ ÔĎÇĎ É ĐŇĹÄÎÁÚÎÁŢĹÎ, ŢÔĎÂŮ ĎÇŇÁÎÉŢÉ×ÁÔŘ ÄĹĘÓÔ×ÉŃ ŇĎÂĎÔĎ×.

îĹ ×ÓĹ ĐĎÉÓËĎ×ŮĹ ŇĎÂĎÔŮ ÉÓĐĎĚŘÚŐŔÔ /robots.txt.

îÁ ÓĹÇĎÄÎŃŰÎÉĘ ÄĹÎŘ ÜÔĎÔ ĆÁĘĚ ĎÂŃÚÁÔĹĚŘÎĎ ÚÁĐŇÁŰÉ×ÁĹÔÓŃ ĐĎÉÓËĎ×ŮÍÉ ŇĎÂĎÔÁÍÉ ÔĎĚŘËĎ ÔÁËÉČ ÓÉÓÔĹÍ ËÁË Altavista, Excite, Infoseek, Lycos, OpenText É WebCrawler.

éÓĐĎĚŘÚĎ×ÁÎÉĹ ÍĹÔÁ-ÔÁÇĎ× HTML.

îÁŢÁĚŘÎŮĘ ĐŇĎĹËÔ, ËĎÔĎŇŮĘ ÂŮĚ ÓĎÚÄÁÎ × ŇĹÚŐĚŘÔÁÔĹ ÓĎÇĚÁŰĹÎÉĘ ÍĹÖÄŐ ĐŇĎÇŇÁÍÍÉÓÔÁÍÉ ÎĹËĎÔĎŇĎÇĎ ŢÉÓĚÁ ËĎÍÍĹŇŢĹÓËÉČ ÉÎÄĹËÓÉŇŐŔÝÉČ ĎŇÇÁÎÉÚÁĂÉĘ (Excite, Infoseek, Lycos, Opentext É WebCrawler) ÎÁ ÎĹÄÁ×ÎĹÍ ÓĎÂŇÁÎÉÉ Distributing Indexing Workshop (W3C) , ÎÉÖĹ.

îÁ ÜÔĎÍ ÓĎÂŇÁÎÉÉ ĎÂÓŐÖÄÁĚĎÓŘ ÉÓĐĎĚŘÚĎ×ÁÎÉĹ ÍĹÔÁ-ÔÁÇĎ× HTML ÄĚŃ ŐĐŇÁ×ĚĹÎÉŃ ĐĎ×ĹÄĹÎÉĹÍ ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ×, ÎĎ ĎËĎÎŢÁÔĹĚŘÎĎÇĎ ÓĎÇĚÁŰĹÎÉŃ ÄĎÓÔÉÇÎŐÔĎ ÎĹ ÂŮĚĎ. âŮĚÉ ĎĐŇĹÄĹĚĹÎŮ ÓĚĹÄŐŔÝÉĹ ĐŇĎÂĚĹÍŮ ÄĚŃ ĎÂÓŐÖÄĹÎÉŃ × ÂŐÄŐÝĹÍ:

îĹĎĐŇĹÄĹĚĹÎÎĎÓÔÉ × ÓĐĹĂÉĆÉËÁĂÉÉ ĆÁĘĚÁ /robots.txt
ôĎŢÎĎĹ ĎĐŇĹÄĹĚĹÎÉĹ ÉÓĐĎĚŘÚĎ×ÁÎÉŃ ÍĹÔÁ-ÔÁÇĎ× HTML, ÉĚÉ ÄĎĐĎĚÎÉÔĹĚŘÎŮĹ ĐĎĚŃ × ĆÁĘĚĹ /robots.txt
éÎĆĎŇÍÁĂÉŃ "Please visit"
ôĹËŐÝÉĘ ËĎÎÔŇĎĚŘ ÉÎĆĎŇÍÁĂÉÉ: ÉÎÔĹŇ×ÁĚ ÉĚÉ ÍÁËÓÉÍŐÍ ĎÔËŇŮÔŮČ ÓĎĹÄÉÎĹÎÉĘ Ó ÓĹŇ×ĹŇĎÍ, ĐŇÉ ËĎÔĎŇŮČ ÍĎÖÎĎ ÎÁŢÉÎÁÔŘ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÓĹŇ×ĹŇ.

ROBOTS ÍĹÔÁ-ÔÁÇÉ

üÔĎÔ ÔÁÇ ĐŇĹÄÎÁÚÎÁŢĹÎ ÄĚŃ ĐĎĚŘÚĎ×ÁÔĹĚĹĘ, ËĎÔĎŇŮĹ ÎĹ ÍĎÇŐÔ ËĎÎÔŇĎĚÉŇĎ×ÁÔŘ ĆÁĘĚ /robots.txt ÎÁ Ó×ĎÉČ ×ĹÂ-ÓÁĘÔÁČ. ôÁÇ ĐĎÚ×ĎĚŃĹÔ ÚÁÄÁÔŘ ĐĎ×ĹÄĹÎÉĹ ĐĎÉÓËĎ×ĎÇĎ ŇĎÂĎÔÁ ÄĚŃ ËÁÖÄĎĘ HTML-ÓÔŇÁÎÉĂŮ, ĎÄÎÁËĎ ĐŇÉ ÜÔĎÍ ÎĹĚŘÚŃ ÓĎ×ÓĹÍ ÉÚÂĹÖÁÔŘ ĎÂŇÁÝĹÎÉŃ ŇĎÂĎÔÁ Ë ÎĹĘ (ËÁË ×ĎÚÍĎÖÎĎ ŐËÁÚÁÔŘ × ĆÁĘĚĹ /robots.txt).

robot_terms - ÜÔĎ ŇÁÚÄĹĚĹÎÎŮĘ ÚÁĐŃÔŮÍÉ ÓĐÉÓĎË ÓĚĹÄŐŔÝÉČ ËĚŔŢĹ×ŮČ ÓĚĎ× (ÚÁÇĚÁ×ÎŮĹ ÉĚÉ ÓÔŇĎŢÎŮĹ ÓÉÍ×ĎĚŮ ŇĎĚÉ ÎĹ ÉÇŇÁŔÔ): ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.

NONE - ÇĎ×ĎŇÉÔ ×ÓĹÍ ŇĎÂĎÔÁÍ ÉÇÎĎŇÉŇĎ×ÁÔŘ ÜÔŐ ÓÔŇÁÎÉĂŐ ĐŇÉ ÉÎÄĹËÓÁĂÉÉ (ÜË×É×ÁĚĹÎÔÎĎ ĎÄÎĎ×ŇĹÍĹÎÎĎÍŐ ÉÓĐĎĚŘÚĎ×ÁÎÉŔ ËĚŔŢĹ×ŮČ ÓĚĎ× NOINDEX, NOFOLLOW).

ALL - ŇÁÚŇĹŰÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÜÔŐ ÓÔŇÁÎÉĂŐ É ×ÓĹ ÓÓŮĚËÉ ÉÚ ÎĹĹ (ÜË×É×ÁĚĹÎÔÎĎ ĎÄÎĎ×ŇĹÍĹÎÎĎÍŐ ÉÓĐĎĚŘÚĎ×ÁÎÉŔ ËĚŔŢĹ×ŮČ ÓĚĎ× INDEX, FOLLOW).

INDEX - ŇÁÚŇĹŰÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÜÔŐ ÓÔŇÁÎÉĂŐ

NOINDEX - ÎĹŇÁÚŇĹŰÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÜÔŐ ÓÔŇÁÎÉĂŐ

FOLLOW - ŇÁÚŇĹŰÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ×ÓĹ ÓÓŮĚËÉ ÉÚ ÜÔĎĘ ÓÔŇÁÎÉĂŮ

NOFOLLOW - ÎĹŇÁÚŇĹŰÁĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÓÓŮĚËÉ ÉÚ ÜÔĎĘ ÓÔŇÁÎÉĂŮ

ĺÓĚÉ ÜÔĎÔ ÍĹÔÁ-ÔÁÇ ĐŇĎĐŐÝĹÎ ÉĚÉ ÎĹ ŐËÁÚÁÎŮ robot_terms, ÔĎ ĐĎ ŐÍĎĚŢÁÎÉŔ ĐĎÉÓËĎ×ŮĘ ŇĎÂĎÔ ĐĎÓÔŐĐÁĹÔ ËÁË ĹÓĚÉ ÂŮ ÂŮĚÉ ŐËÁÚÁÎŮ robot_terms= INDEX, FOLLOW (Ô.Ĺ. ALL). ĺÓĚÉ × CONTENT ĎÂÎÁŇŐÖĹÎĎ ËĚŔŢĹ×ĎĹ ÓĚĎ×Ď ALL, ÔĎ ŇĎÂĎÔ ĐĎÓÔŐĐÁĹÔ ÓĎĎÔ×ĹÔÓÔ×ĹÎÎĎ, ÉÇÎĎŇÉŇŐŃ ×ĎÚÍĎÖÎĎ ŐËÁÚÁÎÎŮĹ ÄŇŐÇÉĹ ËĚŔŢĹ×ŮĹ ÓĚĎ×Á.. ĺÓĚÉ × CONTENT ÉÍĹŔÔÓŃ ĐŇĎÔÉ×ĎĐĎĚĎÖÎŮĹ ĐĎ ÓÍŮÓĚŐ ËĚŔŢĹ×ŮĹ ÓĚĎ×Á, ÎÁĐŇÉÍĹŇ, FOLLOW, NOFOLLOW, ÔĎ ŇĎÂĎÔ ĐĎÓÔŐĐÁĹÔ ĐĎ Ó×ĎĹÍŐ ŐÓÍĎÔŇĹÎÉŔ (× ÜÔĎÍ ÓĚŐŢÁĹ FOLLOW).

ĺÓĚÉ robot_terms ÓĎÄĹŇÖÉÔ ÔĎĚŘËĎ NOINDEX, ÔĎ ÓÓŮĚËÉ Ó ÜÔĎĘ ÓÔŇÁÎÉĂŮ ÎĹ ÉÎÄĹËÓÉŇŐŔÔÓŃ. ĺÓĚÉ robot_terms ÓĎÄĹŇÖÉÔ ÔĎĚŘËĎ NOFOLLOW, ÔĎ ÓÔŇÁÎÉĂÁ ÉÎÄĹËÓÉŇŐĹÔÓŃ, Á ÓÓŮĚËÉ, ÓĎĎÔ×ĹÔÓÔ×ĹÎÎĎ, ÉÇÎĎŇÉŇŐŔÔÓŃ.

KEYWORDS ÍĹÔÁ-ÔÁÇ.

phrases - ŇÁÚÄĹĚĹÎÎŮĘ ÚÁĐŃÔŮÍÉ ÓĐÉÓĎË ÓĚĎ× ÉĚÉ ÓĚĎ×ĎÓĎŢĹÔÁÎÉĘ (ÚÁÇĚÁ×ÎŮĹ É ÓÔŇĎŢÎŮĹ ÓÉÍ×ĎĚŮ ŇĎĚÉ ÎĹ ÉÇŇÁŔÔ), ËĎÔĎŇŮĹ ĐĎÍĎÇÁŔÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ ÓÔŇÁÎÉĂŐ (Ô.Ĺ. ĎÔŇÁÖÁŔÔ ÓĎÄĹŇÖÁÎÉĹ ÓÔŇÁÎÉĂŮ). çŇŐÂĎ ÇĎ×ĎŇŃ, ÜÔĎ ÔĹ ÓĚĎ×Á, × ĎÔ×ĹÔ ÎÁ ËĎÔĎŇŮĹ ĐĎÉÓËĎ×ÁŃ ÓÉÓÔĹÍÁ ×ŮÄÁÓÔ ÜÔĎÔ ÄĎËŐÍĹÎÔ.

DESCRIPTION ÍĹÔÁ-ÔÁÇ.

text - ÔĎÔ ÔĹËÓÔ, ËĎÔĎŇŮĘ ÂŐÄĹÔ ×Ů×ĎÄÉÔŘÓŃ × ÓŐÍÍÁŇÎĎÍ ĎÔ×ĹÔĹ ÎÁ ÚÁĐŇĎÓ ĐĎĚŘÚĎ×ÁÔĹĚŃ Ë ĐĎÉÓËĎ×ĎĘ ÓÉÓÔĹÍĹ. óĹĘ ÔĹËÓÔ ÎĹ ÄĎĚÖĹÎ ÓĎÄĹŇÖÁÔŘ ÔÁÇĎ× ŇÁÚÍĹÔËÉ É ĚĎÇÉŢÎĹĹ ×ÓĹÇĎ ×ĐÉÓÁÔŘ × ÎĹÇĎ ÓÍŮÓĚ ÄÁÎÎĎÇĎ ÄĎËŐÍĹÎÔÁ ÎÁ ĐÁŇŐ-ÔŇĎĘËŐ ÓÔŇĎË.

đŇĹÄĐĎĚÁÇÁĹÍŮĹ ×ÁŇÉÁÎÔŮ ÉÓËĚŔŢĹÎÉŃ ĐĎ×ÔĎŇÎŮČ ĐĎÓĹÝĹÎÉĘ Ó ĐĎÍĎÝŘŔ ÍĹÔÁ-ÔÁÇĎ× HTML

îĹËĎÔĎŇŮĹ ËĎÍÍĹŇŢĹÓËÉĹ ĐĎÉÓËĎ×ŮĹ ŇĎÂĎÔŮ ŐÖĹ ÉÓĐĎĚŘÚŐŔÔ ÍĹÔÁ-ÔÁÇÉ, ĐĎÚ×ĎĚŃŔÝÉĹ ĎÓŐÝĹÓÔ×ĚŃÔŘ "Ó×ŃÚŘ" ÍĹÖÄŐ ŇĎÂĎÔĎÍ É ×ĹÂÍÁÓÔĹŇĎÍ. Altavista ÉÓĐĎĚŘÚŐĹÔ KEYWORDS ÍĹÔÁ-ÔÁÇ, Á Infoseek ÉÓĐĎĚŘÚŐĹÔ KEYWORDS É DESCRIPTION ÍĹÔÁ-ÔÁÇÉ.

éÎÄĹËÓÉŇĎ×ÁÔŘ ÄĎËŐÍĹÎÔ ĎÄÉÎ ŇÁÚ ÉĚÉ ÄĹĚÁÔŘ ÜÔĎ ŇĹÇŐĚŃŇÎĎ?

÷ĹÂÍÁÓÔĹŇ ÍĎÖĹÔ "ÓËÁÚÁÔŘ" ĐĎÉÓËĎ×ĎÍŐ ŇĎÂĎÔŐ ÉĚÉ ĆÁĘĚŐ bookmark ĐĎĚŘÚĎ×ÁÔĹĚŃ, ŢÔĎ ÓĎÄĹŇÖÉÍĎĹ ÔĎÇĎ ÉĚÉ ÉÎĎÇĎ ĆÁĘĚÁ ÂŐÄĹÔ ÉÚÍĹÎŃÔŘÓŃ. ÷ ÜÔĎÍ ÓĚŐŢÁĹ ŇĎÂĎÔ ÎĹ ÂŐÄĹÔ ÓĎČŇÁÎŃÔŘ URL, Á ÂŇĎŐÚĹŇ ĐĎĚŘÚĎ×ÁÔĹĚŃ ×ÎĹÓĹÔ ÉĚÉ ÎĹ ×ÎĹÓĹÔ ÜÔĎ ĆÁĘĚ × bookmark. đĎËÁ ÜÔÁ ÉÎĆĎŇÍÁĂÉŃ ĎĐÉÓŮ×ÁĹÔÓŃ ÔĎĚŘËĎ × ĆÁĘĚĹ /robots.txt, ĐĎĚŘÚĎ×ÁÔĹĚŘ ÎĹ ÂŐÄĹÔ ÚÎÁÔŘ Ď ÔĎÍ, ŢÔĎ ÜÔÁ ÓÔŇÁÎÉĂÁ ÂŐÄĹÔ ÉÚÍĹÎŃÔŘÓŃ.

íĹÔÁ-ÔÁÇ DOCUMENT-STATE ÍĎÖĹÔ ÂŮÔŘ ĐĎĚĹÚĹÎ ÄĚŃ ÜÔĎÇĎ. đĎ ŐÍĎĚŢÁÎÉŔ, ÜÔĎÔ ÍĹÔÁ-ÔÁÇ ĐŇÉÎÉÍÁĹÔÓŃ Ó CONTENT=STATIC.

ëÁË ÉÓËĚŔŢÉÔŘ ÉÎÄĹËÓÉŇĎ×ÁÎÉĹ ÇĹÎĹŇÉŇŐĹÍŮČ ÓÔŇÁÎÉĂ ÉĚÉ ÄŐÂĚÉŇĎ×ÁÎÉĹ ÄĎËŐÍĹÎÔĎ×, ĹÓĚÉ ĹÓÔŘ ÚĹŇËÁĚÁ ÓĹŇ×ĹŇÁ?

çĹÎĹŇÉŇŐĹÍŮĹ ÓÔŇÁÎÉĂŮ - ÓÔŇÁÎÉĂŮ, ĐĎŇĎÖÄÁĹÍŮĹ ÄĹĘÓÔ×ÉĹÍ CGI-ÓËŇÉĐÔĎ×. éČ ÎÁ×ĹŇÎŃËÁ ÎĹ ÓĚĹÄŐĹÔ ÉÎÄĹËÓÉŇĎ×ÁÔŘ, ĐĎÓËĎĚŘËŐ ĹÓĚÉ ĐĎĐŇĎÂĎ×ÁÔŘ ĐŇĎ×ÁĚÉÔŘÓŃ × ÎÉČ ÉÚ ĐĎÉÓËĎ×ĎĘ ÓÉÓÔĹÍŮ, ÂŐÄĹÔ ×ŮÄÁÎÁ ĎŰÉÂËÁ. ţÔĎ ËÁÓÁĹÔÓŃ ÚĹŇËÁĚ, ÔĎ ÎĹÇĎÖĹ, ËĎÇÄÁ ×ŮÄÁŔÔÓŃ Ä×Ĺ ŇÁÚÎŮĹ ÓÓŮĚËÉ ÎÁ ŇÁÚÎŮĹ ÓĹŇ×ĹŇÁ, ÎĎ Ó ĎÄÎÉÍ É ÔĹÍ ÖĹ ÓĎÄĹŇÖÉÍŮÍ. ţÔĎÂŮ ÜÔĎÇĎ ÉÚÂĹÖÁÔŘ, ÓĚĹÄŐĹÔ ÉÓĐĎĚŘÚĎ×ÁÔŘ ÍĹÔÁ-ÔÁÇ URL Ó ŐËÁÚÁÎÉĹÍ ÁÂÓĎĚŔÔÎĎÇĎ URL ÜÔĎÇĎ ÄĎËŐÍĹÎÔÁ (× ÓĚŐŢÁĹ ÚĹŇËÁĚ - ÎÁ ÓĎĎÔ×ĹÔÓÔ×ŐŔÝŐŔ ÓÔŇÁÎÉĂŐ ÇĚÁ×ÎĎÇĎ ÓĹŇ×ĹŇÁ).

éÓÔĎŢÎÉËÉ

Charles P.Kollar, John R.R. Leavitt, Michael Mauldin, Robot Exclusion Standard Revisited, www.kollar.com/robots.html
Martijn Koster, Standard for robot exclusion, info.webcrawler.com/mak/projects/robots/robots.html

óÔÁÎÄÁŇÔ ÉÓËĚŔŢĹÎÉĘ ÄĚŃ ŇĎÂĎÔĎ× Standard for robot exclusion

Martijn Koster , ĐĹŇĹ×ĎÄ á. áĚÉËÂĹŇĎ×Á

óÔÁÔŐÓ ÜÔĎÇĎ ÄĎËŐÍĹÎÔÁ
÷×ĹÄĹÎÉĹ
îÁÚÎÁŢĹÎÉĹ
ćĎŇÍÁÔ
đŇÉÍĹŇŮ
đŇÉÍĹŢÁÎÉŃ ĐĹŇĹ×ĎÄŢÉËÁ
áÄŇĹÓÁ Á×ÔĎŇĎ×

óÔÁÔŐÓ ÜÔĎÇĎ ÄĎËŐÍĹÎÔÁ

üÔĎÔ ÄĎËŐÍĹÎÔ ÓĎÓÔÁ×ĚĹÎ 30 ÉŔĚŃ 1994 ÇĎÄÁ ĐĎ ÍÁÔĹŇÉÁĚÁÍ ĎÂÓŐÖÄĹÎÉĘ × ÔĹĚĹËĎÎĆĹŇĹÎĂÉÉ robots-request@nexor.co.uk (ÓĹĘŢÁÓ ËĎÎĆĹŇĹÎĂÉŃ ĐĹŇĹÎĹÓĹÎÁ ÎÁ WebCrawler. đĎÄŇĎÂÎĎÓÔÉ ÓÍ. Robots pages at WebCrawler info.webcrawler.com/mak/projects/robots/) ÍĹÖÄŐ ÂĎĚŘŰÉÎÓÔ×ĎÍ ĐŇĎÉÚ×ĎÄÉÔĹĚĹĘ ĐĎÉÓËĎ×ŮČ ŇĎÂĎÔĎ× É ÄŇŐÇÉÍÉ ÚÁÉÎÔĹŇĹÓĎ×ÁÎÎŮÍÉ ĚŔÄŘÍÉ.ôÁËÖĹ ÜÔÁ ÔĹÍÁ ĎÔËŇŮÔÁ ÄĚŃ ĎÂÓŐÖÄĹÎÉŃ × ÔĹĚĹËĎÎĆĹŇĹÎĂÉÉ Technical World Wide Web www-talk@info.cern.ch óĹĘ ÄĎËŐÍĹÎÔ ĎÓÎĎ×ÁÎ ÎÁ ĐŇĹÄŮÄŐÝĹÍ ŇÁÂĎŢĹÍ ĐŇĎĹËÔĹ ĐĎÄ ÔÁËÉÍ ÖĹ ÎÁÚ×ÁÎÉĹÍ.

üÔĎÔ ÄĎËŐÍĹÎÔ ÎĹ Ń×ĚŃĹÔÓŃ ĎĆÉĂÉÁĚŘÎŮÍ ÉĚÉ ŢŘÉÍ-ĚÉÂĎ ËĎŇĐĎŇÁÔÉ×ÎŮÍ ÓÔÁÎÄÁŇÔĎÍ, É ÎĹ ÇÁŇÁÎÔÉŇŐĹÔ ÔĎÇĎ, ŢÔĎ ×ÓĹ ÎŮÎĹŰÎÉĹ É ÂŐÄŐÝÉĹ ĐĎÉÓËĎ×ŮĹ ŇĎÂĎÔŮ ÂŐÄŐÔ ÉÓĐĎĚŘÚĎ×ÁÔŘ ĹÇĎ. ÷ ÓĎĎÔ×ĹÔÓÔ×ÉÉ Ó ÎÉÍ ÂĎĚŘŰÉÎÓÔ×Ď ĐŇĎÉÚ×ĎÄÉÔĹĚĹĘ ŇĎÂĎÔĎ× ĐŇĹÄĚÁÇÁĹÔ ×ĎÚÍĎÖÎĎÓÔŘ ÚÁÝÉÔÉÔŘ ÷ĹÂ-ÓĹŇ×ĹŇŮ ĎÔ ÎĹÖĹĚÁÔĹĚŘÎĎÇĎ ĐĎÓĹÝĹÎÉŃ ÉČ ĐĎÉÓËĎ×ŮÍÉ ŇĎÂĎÔÁÍÉ.

đĎÓĚĹÄÎŔŔ ×ĹŇÓÉŔ ÜÔĎÇĎ ÄĎËŐÍĹÎÔÁ ÍĎÖÎĎ ÎÁĘÔÉ ĐĎ ÁÄŇĹÓŐ info.webcrawler.com/mak/projects/robots/robots.html

÷×ĹÄĹÎÉĹ

đĎÉÓËĎ×ŮĹ ŇĎÂĎÔŮ (wanderers, spiders) - ÜÔĎ ĐŇĎÇŇÁÍÍŮ, ËĎÔĎŇŮĹ ÉÎÄĹËÓÉŇŐŔÔ ×ĹÂ-ÓÔŇÁÎÉĂŮ × ÓĹÔÉ Internet.

÷ 1993 É 1994 ÇĎÄÁČ ×ŮŃÓÎÉĚĎÓŘ, ŢÔĎ ÉÎÄĹËÓÉŇĎ×ÁÎÉĹ ŇĎÂĎÔÁÍÉ ÓĹŇ×ĹŇĎ× ĐĎŇĎĘ ĐŇĎÉÓČĎÄÉÔ ĐŇĎÔÉ× ÖĹĚÁÎÉŃ ×ĚÁÄĹĚŘĂĹ× ÜÔÉČ ÓĹŇ×ĹŇĎ×. ÷ ŢÁÓÔÎĎÓÔÉ, ÉÎĎÇÄÁ ŇÁÂĎÔÁ ŇĎÂĎÔĎ× ÚÁÔŇŐÄÎŃĹÔ ŇÁÂĎÔŐ Ó ÓĹŇ×ĹŇĎÍ ĎÂŮŢÎŮČ ĐĎĚŘÚĎ×ÁÔĹĚĹĘ, ÉÎĎÇÄÁ ĎÄÎÉ É ÔĹ ÖĹ ĆÁĘĚŮ ÉÎÄĹËÓÉŇŐŔÔÓŃ ÎĹÓËĎĚŘËĎ ŇÁÚ. ÷ ÄŇŐÇÉČ ÓĚŐŢÁŃČ ŇĎÂĎÔŮ ÉÎÄĹËÓÉŇŐŔÔ ÎĹ ÔĎ, ŢÔĎ ÎÁÄĎ, ÎÁĐŇÉÍĹŇ, ĎŢĹÎŘ "ÇĚŐÂĎËÉĹ" ×ÉŇÔŐÁĚŘÎŮĹ ÄÉŇĹËÔĎŇÉÉ, ×ŇĹÍĹÎÎŐŔ ÉÎĆĎŇÍÁĂÉŔ ÉĚÉ CGI-ÓËŇÉĐÔŮ. üÔĎÔ ÓÔÁÎÄÁŇÔ ĐŇÉÚ×ÁÎ ŇĹŰÉÔŘ ĐĎÄĎÂÎŮĹ ĐŇĎÂĚĹÍŮ.

îÁÚÎÁŢĹÎÉĹ

äĚŃ ÔĎÇĎ, ŢÔĎÂŮ ÉÓËĚŔŢÉÔŘ ĐĎÓĹÝĹÎÉĹ ÓĹŇ×ĹŇÁ ÉĚÉ ĹÇĎ ŢÁÓÔĹĘ ŇĎÂĎÔĎÍ ÎĹĎÂČĎÄÉÍĎ ÓĎÚÄÁÔŘ ÎÁ ÓĹŇ×ĹŇĹ ĆÁĘĚ, ÓĎÄĹŇÖÁÝÉĘ ÉÎĆĎŇÍÁĂÉŔ ÄĚŃ ŐĐŇÁ×ĚĹÎÉŃ ĐĎ×ĹÄĹÎÉĹÍ ĐĎÉÓËĎ×ĎÇĎ ŇĎÂĎÔÁ. üÔĎÔ ĆÁĘĚ ÄĎĚÖĹÎ ÂŮÔŘ ÄĎÓÔŐĐĹÎ ĐĎ ĐŇĎÔĎËĎĚŐ HTTP ĐĎ ĚĎËÁĚŘÎĎÍŐ URL /robots.txt. óĎÄĹŇÖÁÎÉĹ ÜÔĎÇĎ ĆÁĘĚÁ ÓÍ. ÎÉÖĹ.

ôÁËĎĹ ŇĹŰĹÎÉĹ ÂŮĚĎ ĐŇÉÎŃÔĎ ÄĚŃ ÔĎÇĎ, ŢÔĎÂŮ ĐĎÉÓËĎ×ŮĘ ŇĎÂĎÔ ÍĎÇ ÎÁĘÔÉ ĐŇÁ×ÉĚÁ, ĎĐÉÓŮ×ÁŔÝÉĹ ÔŇĹÂŐĹÍŮĹ ĎÔ ÎĹÇĎ ÄĹĘÓÔ×ÉŃ, ×ÓĹÇĎ ĚÉŰŘ ĐŇĎÓÔŮÍ ÚÁĐŇĎÓĎÍ ĎÄÎĎÇĎ ĆÁĘĚÁ. ëŇĎÍĹ ÔĎÇĎ ĆÁĘĚ /robots.txt ĚĹÇËĎ ÓĎÚÄÁÔŘ ÎÁ ĚŔÂĎÍ ÉÚ ÓŐÝĹÓÔ×ŐŔÝÉČ ÷ĹÂ-ÓĹŇ×ĹŇĎ×.

÷ŮÂĎŇ ÉÍĹÎÎĎ ÔÁËĎÇĎ URL ÍĎÔÉ×ÉŇĎ×ÁÎ ÎĹÓËĎĚŘËÉÍÉ ËŇÉÔĹŇÉŃÍÉ:

éÍŃ ĆÁĘĚÁ ÄĎĚÖÎĎ ÂŮĚĎ ÂŮÔŘ ĎÄÉÎÁËĎ×ŮÍ ÄĚŃ ĚŔÂĎĘ ĎĐĹŇÁĂÉĎÎÎĎĘ ÓÉÓÔĹÍŮ
ňÁÓŰÉŇĹÎÉĹ ÄĚŃ ÜÔĎÇĎ ĆÁĘĚŃ ÎĹ ÄĎĚÖÎĎ ÂŮĚĎ ÔŇĹÂĎ×ÁÔŘ ËÁËĎĘ-ĚÉÂĎ ĐĹŇĹËĎÎĆÉÇŐŇÁĂÉÉ ÓĹŇ×ĹŇÁ
éÍŃ ĆÁĘĚÁ ÄĎĚÖÎĎ ÂŮĚĎ ÂŮÔŘ ĚĹÇËĎ ÚÁĐĎÍÉÎÁŔÝÉÍÓŃ É ĎÔŇÁÖÁÔŘ ĹÇĎ ÎÁÚÎÁŢĹÎÉĹ
÷ĹŇĎŃÔÎĎÓÔŘ ÓĎ×ĐÁÄĹÎÉŃ Ó ÓŐÝĹÓÔ×ŐŔÝÉÍÉ ĆÁĘĚÁÍÉ ÄĎĚÖÎÁ ÂŮĚÁ ÂŮÔŘ ÍÉÎÉÍÁĚŘÎĎĘ

ćĎŇÍÁÔ

ćĎŇÍÁÔ É ÓĹÍÁÎÔÉËÁ ĆÁĘĚÁ /robots.txt ÓĚĹÄŐŔÝÉĹ:

ćÁĘĚ ÄĎĚÖĹÎ ÓĎÄĹŇÖÁÔŘ ĎÄÎŐ ÉĚÉ ÎĹÓËĎĚŘËĎ ÚÁĐÉÓĹĘ (records), ŇÁÚÄĹĚĹÎÎŮČ ĎÄÎĎĘ ÉĚÉ ÎĹÓËĎĚŘËÉÍÉ ĐŐÓÔŮÍÉ ÓÔŇĎËÁÍÉ (ĎËÁÎŢÉ×ÁŔÝÉÍÉÓŃ CR, CR/NL ÉĚÉ NL). ëÁÖÄÁŃ ÚÁĐÉÓŘ ÄĎĚÖÎÁ ÓĎÄĹŇÖÁÔŘ ÓÔŇĎËÉ (lines) × ĆĎŇÍĹ:

"<field>:<optional_space><value><optional_space>".

đĎĚĹ <field> Ń×ĚŃĹÔÓŃ ŇĹÇÉÓÔŇĎÎĹÚÁ×ÉÓÉÍŮÍ.

ëĎÍÍĹÎÔÁŇÉÉ ÍĎÇŐÔ ÂŮÔŘ ×ËĚŔŢĹÎŮ × ĆÁĘĚ × ĎÂŮŢÎĎĘ ÄĚŃ UNIX ĆĎŇÍĹ: ÓÉÍ×ĎĚ # ĎÚÎÁŢÁĹÔ ÎÁŢÁĚĎ ËĎÍÍĹÎÔÁŇÉŃ, ËĎÎĹĂ ÓÔŇĎËÉ - ËĎÎĹĂ ËĎÍÍĹÎÔÁŇÉŃ.

úÁĐÉÓŘ ÄĎĚÖÎÁ ÎÁŢÉÎÁÔŘÓŃ Ó ĎÄÎĎĘ ÉĚÉ ÎĹÓËĎĚŘËÉČ ÓÔŇĎË User-Agent, ÓĚĹÄĎÍ ÄĎĚÖÎÁ ÂŮÔŘ ĎÄÎÁ ÉĚÉ ÎĹÓËĎĚŘËĎ ÓÔŇĎË Disallow, ĆĎŇÍÁÔ ËĎÔĎŇŮČ ĐŇÉ×ĹÄĹÎ ÎÉÖĹ. îĹŇÁÓĐĎÚÎÁÎÎŮĹ ÓÔŇĎËÉ ÉÇÎĎŇÉŇŐŔÔÓŃ.

User-Agent

ÚÎÁŢĹÎÉĹÍ <value> ÜÔĎÇĎ ĐĎĚŃ ÄĎĚÖÎĎ Ń×ĚŃÔŘÓŃ ÉÍŃ ĐĎÉÓËĎ×ĎÇĎ ŇĎÂĎÔÁ, ËĎÔĎŇĎÍŐ × ÜÔĎĘ ÚÁĐÉÓÉ ŐÓÔÁÎÁ×ĚÉ×ÁŔÔÓŃ ĐŇÁ×Á ÄĎÓÔŐĐÁ.
ĹÓĚÉ × ÚÁĐÉÓÉ ŐËÁÚÁÎĎ ÂĎĚĹĹ ĎÄÎĎÇĎ ÉÍĹÎÉ ŇĎÂĎÔÁ, ÔĎ ĐŇÁ×Á ÄĎÓÔŐĐÁ ŇÁÓĐŇĎÓÔŇÁÎŃŔÔÓŃ ÄĚŃ ×ÓĹČ ŐËÁÚÁÎÎŮČ ÉÍĹÎ.
ÚÁÇĚÁ×ÎŮĹ ÉĚÉ ÓÔŇĎŢÎŮĹ ÓÉÍ×ĎĚŮ ŇĎĚÉ ÎĹ ÉÇŇÁŔÔ
ĹÓĚÉ × ËÁŢĹÓÔ×Ĺ ÚÎÁŢ