The Web Robots FAQ Original of this document is here Ą http://info.webcrawler.com/mak/projects/robots/faq.html These frequently asked questions about Web robots.
Send suggestions and comments to Martijn Koster.

About WWW robots
Indexing robots
For Server Administrators
Robots exclusion standard
Availability

About Web Robots

What is a WWW robot?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

What is an agent?

The word "agent" is used for lots of meanings in computing these days. Specifically:

Autonomous agents: are programs that do travel between sites, deciding themselves when to move and what to do (e.g. General Magic's Telescript). These can only travel between special servers and are currently not widespread in the Internet.
Intelligent agents: are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.
User-agent: is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Explorer, Email User-agent like Qualcomm Eudora etc.

What is a search engine?

A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.

What other kinds of robots are there?

Robots can be used for a number of purposes:

Indexing
HTML validation
Link validation
"What's New" monitoring
Mirroring

See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list...

So what are Robots, Spiders, Web Crawlers, Worms, Ants

They're all names for the same sort of thing, with slightly different connotations:

Robots: the generic name, see above.
Spiders: same as robots, but sounds cooler in the press.
Worms: same as robots, although technically a worm is a replicating program, unlike a robot.
Web crawlers: same as robots, but note WebCrawler is a specific robot
WebAnts: distributed cooperating robots.

Aren't robots bad for the web?

There are a few reasons people believe robots are bad for the Web:

Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects
Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites.

But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.

So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.

Are there any robot books?

Yes:

Internet Agents: Spiders, Wanderers, Brokers, and Bots by Fah-Chun Cheong.

This books covers Web robots, commerce transaction agents, Mud agents, and a few others. It includes source code for a simple Web robot based on top of libwww-perl4.

Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.

Published by New Riders, ISBN 1-56205-463-5.

Bots and Other Internet Beasties by Joseph Williams

I haven't seen this myself, but someone said: The William's book 'Bots and other Internet Beasties' was quit disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together.

Published by Sam's, ISBN: 1-57521-016-9

Web Client Programming with Perl by Clinton Wong

This O'Reilly book is planned for Fall 1996, check the O'Reilly Web Site for the current status. It promises to be a practical book, but I haven't seen it yet.

A few others can be found on the The Software Agents Mailing List FAQ

Where do I find out more about robots?

There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html

While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.

Of course the latest version of this FAQ is there.

You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.

Indexing robots

How does a robot decide where to visit?

This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.

Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.

Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.

Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.

How does an indexing robot decide what to index?

If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags.

We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...

How do I register my page with a robot?

You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page.

Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.

For Server Administrators

How do I know if I've been visited by a robot?

You can check your server logs for sites that retrieve many documents, especially in a short time.

If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.

Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.

I've been visited by a robot! Now what?

Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.

If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!

A robot is traversing my whole site too fast!

This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.

First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.

However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.

If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.

If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.

How do I keep a robot off my server?

Read the next section...

Robots exclusion standard

Why do I find entries for /robots.txt in my log files?

They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below.

If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.

Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)

How do I prevent robots scanning my site?

The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: *
Disallow: /

but its easy to be more selective than that.

Where do I find out how /robots.txt files work?

You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism

User-agent: webcrawler
Disallow:

User-agent: lycra
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first two lines, starting with '#', specify a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression.

Two common errors:

Regular expressions are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'.
You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)

Will the /robots.txt standard be extended?

Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress.

What if I can't make a /robots.txt file?

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

The basic idea is that if you include a tag like:

<META NAME="ROBOTS" CONTENT="NOINDEX">

in your HTML document, that document won't be indexed.

If you do:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

the links in that document will not be parsed by the robot.

Availability

Where can I use a robot?

If you mean a search service, check out the various directory pages on the Web, such as Netscape's Exploring the Net or try one of the Meta search services such as MetaSearch

Where can I get a robot?

Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly.

In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.

Where can I get the source code for a robot?

See above -- some may be willing to give out source code.

Alternatively check out the libwww-perl5 package, that has a simple example.

I'm writing a robot, what do I need to be careful of?

Lots. First read through all the stuff on the robot page then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-)

I've written a robot, how do I list it?

Simply fill in a form you can find on The Web Robots Database and email it to me.

Martijn Koster

˝ŐáÚŢŰěÚŢ áŰŢŇ Ţ âŢÜ, ÚĐÚ ŕĐŃŢâĐîâ ŕŢŃŢâë (spiders) ßŢŘáÚŢŇëĺ ÜĐčŘÝ °ÝÔŕŐŮ °ŰŘÚŃŐŕŢŇ, ĆŐÝâŕ ¸ÝäŢŕÜĐćŘŢÝÝëĺ ÂŐĺÝŢŰŢÓŘŮ

˛ŇŐÔŐÝŘŐ
ROBOTS ÜŐâĐ-âĐÓŘ

˛ŇŐÔŐÝŘŐ

ÍâĐ áâĐâěď ŇŢŇáŐ ÝŐ ďŇŰďŐâáď ßŢßëâÚŢŮ ŢŃęďáÝŘâě, ÚĐÚ ŕĐŃŢâĐîâ ßŢŘáÚŢŇëŐ ÜĐčŘÝë ŇŢŢŃéŐ (íâŢ know-how Řĺ ßŕŢŘ×ŇŢÔŘâŐŰŐŮ). žÔÝĐÚŢ, ßŢ ÜŢŐÜă ÜÝŐÝŘî, ŢÝĐ ßŢÜŢÖŐâ ßŢÝďâě ÚĐÚ ÜŢÖÝŢ ăßŕĐŇŰďâě ßŢŇŐÔŐÝŘŐÜ ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ (wanderers, spiders, robots - ßŕŢÓŕĐÜÜë, á ßŢÜŢéěî ÚŢâŢŕëĺ âĐ ŘŰŘ ŘÝĐď ßŢŘáÚŢŇĐď áŘáâŐÜĐ ŢŃčĐŕŘŇĐŐâ áŐâě Ř ŘÝÔŐÚáŘŕăŐâ ŇáâŕŐçĐîéŘŐáď ÔŢÚăÜŐÝâë) Ř ÚĐÚ ßŕĐŇŘŰěÝŢ ßŢáâŕŢŘâě áâŕăÚâăŕă áŐŕŇŐŕĐ Ř áŢÔŐŕÖĐéŘĺáď ÝĐ ÝŐÜ ÔŢÚăÜŐÝâŢŇ, çâŢŃë ˛Đč áŐŕŇŐŕ ŰŐÓÚŢ Ř ĺŢŕŢčŢ ŘÝÔŐÚáŘŕŢŇĐŰáď.

żŐŕŇŢŮ ßŕŘçŘÝŢŮ âŢÓŢ, çâŢ ď ŕŐčŘŰáď ÝĐßŘáĐâě íâă áâĐâěî, ďŇŘŰáď áŰăçĐŮ, ÚŢÓÔĐ ď ŘááŰŐÔŢŇĐŰ äĐŮŰ ŰŢÓŢŇ ÔŢáâăßĐ Ú ÜŢŐÜă áŐŕŇŐŕă Ř ŢŃÝĐŕăÖŘŰ âĐÜ áŰŐÔăîéŘŐ ÔŇŐ áâŕŢÚŘ:

lycosidae.lycos.com - - [01/Mar/1997:21:27:32 -0500] "GET /robots.txt HTTP/1.0" 404 -
lycosidae.lycos.com - - [01/Mar/1997:21:27:39 -0500] "GET / HTTP/1.0" 200 3270

âŢ Őáâě Lycos ŢŃŕĐâŘŰáď Ú ÜŢŐÜă áŐŕŇŐŕă, ÝĐ ßŐŕŇëŮ ×ĐßŕŢá ßŢŰăçŘŰ, çâŢ äĐŮŰĐ /robots.txt ÝŐâ, ŢŃÝîĺĐŰ ßŐŕŇăî áâŕĐÝŘćă, Ř ŢâŇĐŰŘŰ. ľáâŐáâŇŐÝÝŢ, ÜÝŐ íâŢ ÝŐ ßŢÝŕĐŇŘŰŢáě, Ř ď ÝĐçĐŰ ŇëďáÝďâě çâŢ Ú çŐÜă.

žÚĐ×ëŇĐŐâáď, ŇáŐ "ăÜÝëŐ" ßŢŘáÚŢŇëŐ ÜĐčŘÝë áÝĐçĐŰĐ ŢŃŕĐéĐîâáď Ú íâŢÜă äĐŮŰă, ÚŢâŢŕëŮ ÔŢŰÖŐÝ ßŕŘáăâáâŇŢŇĐâě ÝĐ ÚĐÖÔŢÜ áŐŕŇŐŕŐ. ÍâŢâ äĐŮŰ ŢßŘáëŇĐŐâ ßŕĐŇĐ ÔŢáâăßĐ ÔŰď ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ, ßŕŘçŐÜ áăéŐáâŇăŐâ ŇŢ×ÜŢÖÝŢáâě ăÚĐ×Đâě ÔŰď ŕĐ×ŰŘçÝëĺ ŕŢŃŢâŢŇ ŕĐ×ÝëŐ ßŕĐŇĐ. ´Űď ÝŐÓŢ áăéŐáâŇăŐâ áâĐÝÔĐŕâ ßŢÔ ÝĐ×ŇĐÝŘŐÜ Standart for Robot Exclusion.

żŢ ÜÝŐÝŘî ťăŘáĐ źŢÝěŐ (Louis Monier, Altavista), âŢŰěÚŢ 5% ŇáŐĺ áĐŮâŢŇ Ň ÝĐáâŢďéŐŐ ŇŕŐÜď ŘÜŐŐâ ÝŐ ßăáâëŐ äĐŮŰë /robots.txt ŐáŰŘ ŇŢŢŃéŐ ŢÝŘ (íâŘ äĐŮŰë) âĐÜ áăéŐáâŇăîâ. ÍâŢ ßŢÔâŇŐŕÖÔĐŐâáď ŘÝäŢŕÜĐćŘŐŮ, áŢŃŕĐÝÝŢŮ ßŕŘ ÝŐÔĐŇÝŐÜ ŘááŰŐÔŢŇĐÝŘŘ ŰŢÓŢŇ ŕĐŃŢâë ŕŢŃŢâĐ Lycos. ČĐŕŰě şŢŰŰĐŕ (Charles P.Kollar, Lycos) ßŘčŐâ, çâŢ âŢŰěÚŢ 6% Ţâ ŇáŐĺ ×ĐßŕŢáŢŇ ÝĐ ßŕŐÔÜŐâ /robots.txt ŘÜŐîâ ÚŢÔ ŕŐ×ăŰěâĐâĐ 200. ˛Ţâ ÝŐáÚŢŰěÚŢ ßŕŘçŘÝ, ßŢ ÚŢâŢŕëÜ íâŢ ßŕŢŘáĺŢÔŘâ:

ŰîÔŘ, ÚŢâŢŕëŐ ăáâĐÝĐŇŰŘŇĐîâ ˛ŐŃ-áŐŕŇŐŕĐ, ßŕŢáâŢ ÝŐ ×ÝĐîâ ÝŘ ŢŃ íâŢÜ áâĐÝÔĐŕâŐ, ÝŘ Ţ ÝŐŢŃĺŢÔŘÜŢáâŘ áăéŐáâŇŢŇĐÝŘď äĐŮŰĐ /robots.txt.
ÝŐ ŢŃď×ĐâŐŰěÝŢ çŐŰŢŇŐÚ, ŘÝáâĐŰŰŘŕŢŇĐŇčŘŮ ˛ŐŃ-áŐŕŇŐŕ, ×ĐÝŘÜĐŐâáď ŐÓŢ ÝĐßŢŰÝŐÝŘŐÜ, Đ âŢâ, ÚâŢ ďŇŰďŐâáď ŇŐŃÜĐáâŐŕŢÜ, ÝŐ ŘÜŐŐâ ÔŢŰÖÝŢÓŢ ÚŢÝâĐÚâĐ á ĐÔÜŘÝŘáâŕĐâŢŕŢÜ áĐÜŢŮ "ÖŐŰŐ×ďÚŘ".
íâŢ çŘáŰŢ ŢâŕĐÖĐŐâ çŘáŰŢ áĐŮâŢŇ, ÚŢâŢŕëŐ ÔŐŮáâŇŘâŐŰěÝŢ ÝăÖÔĐîâáď Ň ŘáÚŰîçŐÝŘŘ ŰŘčÝŘĺ ×ĐßŕŢáŢŇ ŕŢŃŢâŢŇ, ßŢáÚŢŰěÚă ÝŐ ÝĐ ŇáŐĺ áŐŕŇŐŕĐĺ ŘÜŐŐâáď âĐÚŢŮ áăéŐáâŇŐÝÝëŮ âŕĐäŘÚ, ßŕŘ ÚŢâŢŕŢÜ ßŢáŐéŐÝŘŐ áŐŕŇŐŕĐ ßŢŘáÚŢŇëÜ ŕŢŃŢâŢÜ, áâĐÝŢŇŘâáď ×ĐÜŐâÝëÜ ÔŰď ßŕŢáâëĺ ßŢŰě×ŢŇĐâŐŰŐŮ.

ÄŢŕÜĐâ äĐŮŰĐ /robots.txt.

ÄĐŮŰ /robots.txt ßŕŐÔÝĐ×ÝĐçŐÝ ÔŰď ăÚĐ×ĐÝŘď ŇáŐÜ ßŢŘáÚŢŇëÜ ŕŢŃŢâĐÜ (spiders) ŘÝÔŐÚáŘŕŢŇĐâě ŘÝäŢŕÜĐćŘŢÝÝëŐ áŐŕŇŐŕĐ âĐÚ, ÚĐÚ ŢßŕŐÔŐŰŐÝŢ Ň íâŢÜ äĐŮŰŐ, â.Ő. âŢŰěÚŢ âŐ ÔŘŕŐÚâŢŕŘŘ Ř äĐŮŰë áŐŕŇŐŕĐ, ÚŢâŢŕëŐ ˝ľ ŢßŘáĐÝë Ň /robots.txt. ÍâŢ äĐŮŰ ÔŢŰÖŐÝ áŢÔŐŕÖĐâě 0 ŘŰŘ ŃŢŰŐŐ ×ĐßŘáŐŮ, ÚŢâŢŕëŐ áŇď×ĐÝë á âŐÜ ŘŰŘ ŘÝëÜ ŕŢŃŢâŢÜ (çâŢ ŢßŕŐÔŐŰďŐâáď ×ÝĐçŐÝŘŐÜ ßŢŰď agent_id), Ř ăÚĐ×ëŇĐîâ ÔŰď ÚĐÖÔŢÓŢ ŕŢŃŢâĐ ŘŰŘ ÔŰď ŇáŐĺ áŕĐ×ă çâŢ ŘÜŐÝÝŢ ŘÜ ˝ľ ˝°´ž ŘÝÔŐÚáŘŕŢŇĐâě. ÂŢâ, ÚâŢ ßŘčŐâ äĐŮŰ /robots.txt, ÔŢŰÖŐÝ ăÚĐ×Đâě ßŢÔáâŕŢÚă Product Token ßŢŰď User-Agent, ÚŢâŢŕăî ÚĐÖÔëŮ ŕŢŃŢâ ŇëÔĐŐâ ÝĐ HTTP-×ĐßŕŢá ŘÝÔŐÚáŘŕăŐÜŢÓŢ áŐŕŇŐŕĐ. ˝ĐßŕŘÜŐŕ, ÝëÝŐčÝŘŮ ŕŢŃŢâ Lycos ÝĐ âĐÚŢŮ ×ĐßŕŢá ŇëÔĐŐâ Ň ÚĐçŐáâŇŐ ßŢŰď User-Agent:

	Lycos_Spider_(Rex)/1.0 libwww/3.1

ľáŰŘ ŕŢŃŢâ Lycos ÝŐ ÝĐčŐŰ áŇŢŐÓŢ ŢßŘáĐÝŘď Ň /robots.txt - ŢÝ ßŢáâăßĐŐâ âĐÚ, ÚĐÚ áçŘâĐŐâ ÝăÖÝëÜ. şĐÚ âŢŰěÚŢ ŕŢŃŢâ Lycos "ăŇŘÔŐŰ" Ň äĐŮŰŐ /robots.txt ŢßŘáĐÝŘŐ ÔŰď áŐŃď - ŢÝ ßŢáâăßĐŐâ âĐÚ, ÚĐÚ ŐÜă ßŕŐÔßŘáĐÝŢ.

żŕŘ áŢ×ÔĐÝŘŘ äĐŮŰĐ /robots.txt áŰŐÔăŐâ ăçŘâëŇĐâě ŐéŐ ŢÔŘÝ äĐÚâŢŕ - ŕĐ×ÜŐŕ äĐŮŰĐ. żŢáÚŢŰěÚă ŢßŘáëŇĐŐâáď ÚĐÖÔëŮ äĐŮŰ, ÚŢâŢŕëŮ ÝŐ áŰŐÔăŐâ ŘÝÔŐÚáŘŕŢŇĐâě, ÔĐ ŐéŐ ÔŰď ÜÝŢÓŘĺ âŘßŢŇ ŕŢŃŢâŢŇ ŢâÔŐŰěÝŢ, ßŕŘ ŃŢŰěčŢÜ ÚŢŰŘçŐáâŇŐ ÝŐ ßŢÔŰŐÖĐéŘĺ ŘÝÔŐÚáŘŕŢŇĐÝŘî äĐŮŰŢŇ ŕĐ×ÜŐŕ /robots.txt áâĐÝŢŇŘâáď áŰŘčÚŢÜ ŃŢŰěčŘÜ. ˛ íâŢÜ áŰăçĐŐ áŰŐÔăŐâ ßŕŘÜŐÝďâě ŢÔŘÝ ŘŰŘ ÝŐáÚŢŰěÚŢ áŰŐÔăîéŘĺ áßŢáŢŃŢŇ áŢÚŕĐéŐÝŘď ŕĐ×ÜŐŕĐ /robots.txt:

ăÚĐ×ëŇĐâě ÔŘŕŐÚâŢŕŘî, ÚŢâŢŕăî ÝŐ áŰŐÔăŐâ ŘÝÔŐÚáŘŕŢŇĐâě, Ř, áŢŢâŇŐâáâŇŐÝÝŢ, ÝŐ ßŢÔŰŐÖĐéŘŐ ŘÝÔŐÚáŘŕŢŇĐÝŘî äĐŮŰë ŕĐáßŢŰĐÓĐâě ŘÜŐÝÝŢ Ň ÝŐŮ
áŢ×ÔĐŇĐâě áâŕăÚâăŕă áŐŕŇŐŕĐ á ăçŐâŢÜ ăßŕŢéŐÝŘď ŢßŘáĐÝŘď ŘáÚŰîçŐÝŘŮ Ň /robots.txt
ăÚĐ×ëŇĐâě ŢÔŘÝ áßŢáŢŃ ŘÝÔŐÚáŘŕŢŇĐÝŘď ÔŰď ŇáŐĺ agent_id
ăÚĐ×ëŇĐâě ÜĐáÚŘ ÔŰď ÔŘŕŐÚâŢŕŘŮ Ř äĐŮŰŢŇ

ˇĐßŘáŘ (records) äĐŮŰĐ /robots.txt

žŃéŐŐ ŢßŘáĐÝŘŐ äŢŕÜĐâĐ ×ĐßŘáŘ.

[ # comment string NL ]*

User-Agent: [ [ WS ]+ agent_id ]+ [ [ WS ]* # comment string ]? NL

[ # comment string NL ]*

Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL

[

# comment string NL

|

Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL

]*

[ NL ]+

żĐŕĐÜŐâŕë

žßŘáĐÝŘŐ ßĐŕĐÜŐâŕŢŇ, ßŕŘÜŐÝďŐÜëĺ Ň ×ĐßŘáďĺ /robots.txt

[...]+ şŇĐÔŕĐâÝëŐ áÚŢŃÚŘ áŢ áŰŐÔăîéŘÜ ×Đ ÝŘÜŘ ×ÝĐÚŢÜ + Ţ×ÝĐçĐîâ, çâŢ Ň ÚĐçŐáâŇŐ ßĐŕĐÜŐâŕŢŇ ÔŢŰÖÝë Ńëâě ăÚĐ×ĐÝë ŢÔŘÝ ŘŰŘ ÝŐáÚŢŰěÚŢ âŐŕÜŘÝŢŇ.

˝ĐßŕŘÜŐŕ, ßŢáŰŐ "User-Agent:" çŐŕŐ× ßŕŢŃŐŰ ÜŢÓăâ Ńëâě ăÚĐ×ĐÝë ŢÔŘÝ ŘŰŘ ÝŐáÚŢŰěÚŢ agent_id.

[...]* şŇĐÔŕĐâÝëŐ áÚŢŃÚŘ áŢ áŰŐÔăîéŘÜ ×Đ ÝŘÜŘ ×ÝĐÚŢÜ * Ţ×ÝĐçĐîâ, çâŢ Ň ÚĐçŐáâŇŐ ßĐŕĐÜŐâŕŢŇ ÜŢÓăâ Ńëâě ăÚĐ×ĐÝë ÝŢŰě ŘŰŘ ÝŐáÚŢŰěÚŢ âŐŕÜŘÝŢŇ.

˝ĐßŕŘÜŐŕ, ˛ë ÜŢÖŐâŐ ßŘáĐâě ŘŰŘ ÝŐ ßŘáĐâě ÚŢÜÜŐÝâĐŕŘŘ.

[...]? şŇĐÔŕĐâÝëŐ áÚŢŃÚŘ áŢ áŰŐÔăîéŘÜ ×Đ ÝŘÜŘ ×ÝĐÚŢÜ ? Ţ×ÝĐçĐîâ, çâŢ Ň ÚĐçŐáâŇŐ ßĐŕĐÜŐâŕŢŇ ÜŢÓăâ Ńëâě ăÚĐ×ĐÝë ÝŢŰě ŘŰŘ ŢÔŘÝ âŐŕÜŘÝ.

˝ĐßŕŘÜŐŕ, ßŢáŰŐ "User-Agent: agent_id" ÜŢÖŐâ Ńëâě ÝĐßŘáĐÝ ÚŢÜÜŐÝâĐŕŘŮ.

..|.. Ţ×ÝĐçĐŐâ ŘŰŘ âŢ, çâŢ ÔŢ çŐŕâë, ŘŰŘ âŢ, çâŢ ßŢáŰŐ.

WS ŢÔŘÝ Ř× áŘÜŇŢŰŢŇ - ßŕŢŃŐŰ (011) ŘŰŘ âĐŃăŰďćŘď (040)

NL ŢÔŘÝ Ř× áŘÜŇŢŰŢŇ - ÚŢÝŐć áâŕŢÚŘ (015) , ŇŢ×ŇŕĐâ ÚĐŕŐâÚŘ (012) ŘŰŘ ŢŃĐ íâŘĺ áŘÜŇŢŰĐ (Enter)

User-Agent: ÚŰîçŐŇŢŐ áŰŢŇŢ (×ĐÓŰĐŇÝëŐ Ř ßŕŢßŘáÝëŐ ŃăÚŇë ŕŢŰŘ ÝŐ ŘÓŕĐîâ).

żĐŕĐÜŐâŕĐÜŘ ďŇŰďîâáď agent_id ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ.

Disallow: ÚŰîçŐŇŢŐ áŰŢŇŢ (×ĐÓŰĐŇÝëŐ Ř ßŕŢßŘáÝëŐ ŃăÚŇë ŕŢŰŘ ÝŐ ŘÓŕĐîâ).

żĐŕĐÜŐâŕĐÜŘ ďŇŰďîâáď ßŢŰÝëŐ ßăâŘ Ú ÝŐŘÝÔŐÚáŘŕăŐÜëÜ äĐŮŰĐÜ ŘŰŘ ÔŘŕŐÚâŢŕŘďÜ

# ÝĐçĐŰŢ áâŕŢÚŘ ÚŢÜÜŐÝâĐŕŘŐŇ, comment string - áŢŃáâŇŐÝÝŢ âŐŰŢ ÚŢÜÜŐÝâĐŕŘď.

agent_id ŰîŃŢŐ ÚŢŰŘçŐáâŇŢ áŘÜŇŢŰŢŇ, ÝŐ ŇÚŰîçĐîéŘĺ WS Ř NL, ÚŢâŢŕëŐ ŢßŕŐÔŐŰďîâ agent_id ŕĐ×ŰŘçÝëĺ ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ. ˇÝĐÚ * ŢßŕŐÔŐŰďŐâ ŇáŐĺ ŕŢŃŢâŢŇ áŕĐ×ă.

path_root ŰîŃŢŐ ÚŢŰŘçŐáâŇŢ áŘÜŇŢŰŢŇ, ÝŐ ŇÚŰîçĐîéŘĺ WS Ř NL, ÚŢâŢŕëŐ ŢßŕŐÔŐŰďîâ äĐŮŰë Ř ÔŘŕŐÚâŢŕŘŘ, ÝŐ ßŢÔŰŐÖĐéŘŐ ŘÝÔŐÚáŘŕŢŇĐÝŘî.

ŔĐáčŘŕŐÝÝëŐ ÚŢÜÜŐÝâĐŕŘŘ äŢŕÜĐâĐ.

şĐÖÔĐď ×ĐßŘáě ÝĐçŘÝĐŐâáď áŢ áâŕŢÚŘ User-Agent, Ň ÚŢâŢŕŢŮ ŢßŘáëŇĐŐâáď ÚĐÚŘÜ ŘŰŘ ÚĐÚŢÜă ßŢŘáÚŢŇŢÜă ŕŢŃŢâă íâĐ ×ĐßŘáě ßŕŐÔÝĐ×ÝĐçĐŐâáď. ÁŰŐÔăîéĐď áâŕŢÚĐ: Disallow. ˇÔŐáě ŢßŘáëŇĐîâáď ÝŐ ßŢÔŰŐÖĐéŘŐ ŘÝÔŐÚáĐćŘŘ ßăâŘ Ř äĐŮŰë. ş°ś´°Ď ×ĐßŘáě ´žťś˝° ŘÜŐâě ÚĐÚ ÜŘÝŘÜăÜ íâŘ ÔŇŐ áâŕŢÚŘ (lines). ˛áŐ ŢáâĐŰěÝëŐ áâŕŢÚŘ ďŇŰďîâáď ŢßćŘďÜŘ. ˇĐßŘáě ÜŢÖŐâ áŢÔŐŕÖĐâě ŰîŃŢŐ ÚŢŰŘçŐáâŇŢ áâŕŢÚ ÚŢÜÜŐÝâĐŕŘŐŇ. şĐÖÔĐď áâŕŢÚĐ ÚŢÜÜŐÝâĐŕŘď ÔŢŰÖÝĐ ÝĐçŘÝĐâěáď á áŘÜŇŢŰĐ # . ÁâŕŢÚŘ ÚŢÜÜŐÝâĐŕŘŐŇ ÜŢÓăâ Ńëâě ßŢÜŐéŐÝë Ň ÚŢÝŐć áâŕŢÚ User-Agent Ř Disallow. ÁŘÜŇŢŰ # Ň ÚŢÝćŐ íâŘĺ áâŕŢÚ ŘÝŢÓÔĐ ÔŢŃĐŇŰďŐâáď ÔŰď âŢÓŢ, çâŢŃë ăÚĐ×Đâě ßŢŘáÚŢŇŢÜă ŕŢŃŢâă, çâŢ ÔŰŘÝÝĐď áâŕŢÚĐ agent_id ŘŰŘ path_root ×ĐÚŢÝçŐÝĐ. ľáŰŘ Ň áâŕŢÚŐ User-Agent ăÚĐ×ĐÝŢ ÝŐáÚŢŰěÚŢ agent_id, âŢ ăáŰŢŇŘŐ path_root Ň áâŕŢÚŐ Disallow ŃăÔŐâ ŇëßŢŰÝŐÝŢ ÔŰď ŇáŐĺ ŢÔŘÝĐÚŢŇŢ. žÓŕĐÝŘçŐÝŘŮ ÝĐ ÔŰŘÝă áâŕŢÚ User-Agent Ř Disallow ÝŐâ. ľáŰŘ ßŢŘáÚŢŇëŮ ŕŢŃŢâ ÝŐ ŢŃÝĐŕăÖŘŰ Ň äĐŮŰŐ /robots.txt áŇŢŐÓŢ agent_id, âŢ ŢÝ ŘÓÝŢŕŘŕăŐâ /robots.txt.

ľáŰŘ ÝŐ ăçŘâëŇĐâě áßŐćŘäŘÚă ŕĐŃŢâë ÚĐÖÔŢÓŢ ßŢŘáÚŢŇŢÓŢ ŕŢŃŢâĐ, ÜŢÖÝŢ ăÚĐ×Đâě ŘáÚŰîçŐÝŘď ÔŰď ŇáŐĺ ŕŢŃŢâŢŇ áŕĐ×ă. ÍâŢ ÔŢáâŘÓĐŐâáď ×ĐÔĐÝŘŐÜ áâŕŢÚŘ

	User-Agent: *

ľáŰŘ ßŢŘáÚŢŇëŮ ŕŢŃŢâ ŢŃÝĐŕăÖŘâ Ň äĐŮŰŐ /robots.txt ÝŐáÚŢŰěÚŢ ×ĐßŘáŐŮ á ăÔŢŇŰŐâŇŢŕďîéŘÜ ŐÓŢ ×ÝĐçŐÝŘŐÜ agent_id, âŢ ŕŢŃŢâ ŇŢŰŐÝ ŇëŃŘŕĐâě ŰîŃăî Ř× ÝŘĺ.

şĐÖÔëŮ ßŢŘáÚŢŇëŮ ŕŢŃŢâ ŃăÔŐâ ŢßŕŐÔŐŰďâě ĐŃáŢŰîâÝëŮ URL ÔŰď çâŐÝŘď á áŐŕŇŐŕĐ á ŘáßŢŰě×ŢŇĐÝŘŐÜ ×ĐßŘáŐŮ /robots.txt. ˇĐÓŰĐŇÝëŐ Ř áâŕŢçÝëŐ áŘÜŇŢŰë Ň path_root ¸źľÎÂ ×ÝĐçŐÝŘŐ.

żŕŘÜŐŕë.

żŕŘÜŐŕ 1:

User-Agent: *

Disallow: /

User-Agent: Lycos

Disallow: /cgi-bin/ /tmp/

˛ ßŕŘÜŐŕŐ 1 äĐŮŰ /robots.txt áŢÔŐŕÖŘâ ÔŇŐ ×ĐßŘáŘ. żŐŕŇĐď ŢâÝŢáŘâáď ÚŢ ŇáŐÜ ßŢŘáÚŢŇëÜ ŕŢŃŢâĐÜ Ř ×ĐßŕŐéĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě ŇáŐ äĐŮŰë. ˛âŢŕĐď ŢâÝŢáŘâáď Ú ßŢŘáÚŢŇŢÜă ŕŢŃŢâă Lycos Ř ßŕŘ ŘÝÔŐÚáŘŕŢŇĐÝŘŘ ŘÜ áŐŕŇŐŕĐ ×ĐßŕŐéĐŐâ ÔŘŕŐÚâŢŕŘŘ /cgi-bin/ Ř /tmp/, Đ ŢáâĐŰěÝëŐ - ŕĐ×ŕŐčĐŐâ. ÂĐÚŘÜ ŢŃŕĐ×ŢÜ áŐŕŇŐŕ ŃăÔŐâ ßŕŢŘÝÔŐÚáŘŕŢŇĐÝ âŢŰěÚŢ áŘáâŐÜŢŮ Lycos.

żŕŘÜŐŕ 2:

User-Agent: Copernicus Fred

Disallow:

User-Agent: * Rex

Disallow: /t

˛ ßŕŘÜŐŕŐ 2 äĐŮŰ /robots.txt áŢÔŐŕÖŘâ ÔŇŐ ×ĐßŘáŘ. żŐŕŇĐď ŕĐ×ŕŐčĐŐâ ßŢŘáÚŢŇëÜ ŕŢŃŢâĐÜ Copernicus Ř Fred ŘÝÔŐÚáŘŕŢŇĐâě ŇŐáě áŐŕŇŐŕ. ˛âŢŕĐď - ×ĐßŕŐéĐŐâ ŇáŐÜ Ř ŢáŐŃŐÝÝŢ ŕŢŃŢâă Rex ŘÝÔŐÚáŘŕŢŇĐâě âĐÚŘŐ ÔŘŕŐÚâŢŕŘŘ Ř äĐŮŰë, ÚĐÚ /tmp/, /tea-time/, /top-cat.txt, /traverse.this Ř â.Ô. ÍâŢ ÚĐÚ ŕĐ× áŰăçĐŮ ×ĐÔĐÝŘď ÜĐáÚŘ ÔŰď ÔŘŕŐÚâŢŕŘŮ Ř äĐŮŰŢŇ.

żŕŘÜŐŕ 3:

# This is for every spider!

User-Agent: *

# stay away from this

Disallow: /spiders/not/here/ #and everything in it

Disallow: # a little nothing

Disallow: #This could be habit forming!

# Don't comments make code much more readable!!!

˛ ßŕŘÜŐŕŐ 3 - ŢÔÝĐ ×ĐßŘáě. ˇÔŐáě ŇáŐÜ ŕŢŃŢâĐÜ ×ĐßŕŐéĐŐâáď ŘÝÔŐÚáŘŕŢŇĐâě ÔŘŕŐÚâŢŕŘî /spiders/not/here/, ŇÚŰîçĐď âĐÚŘŐ ßăâŘ Ř äĐŮŰë ÚĐÚ /spiders/not/here/really/, /spiders/not/here/yes/even/me.html. žÔÝĐÚŢ áîÔĐ ÝŐ ŇĺŢÔďâ /spiders/not/ ŘŰŘ /spiders/not/her (Ň ÔŘŕŐÚâŢŕŘŘ '/spiders/not/').

˝ŐÚŢâŢŕëŐ ßŕŢŃŰŐÜë, áŇď×ĐÝÝëŐ á ßŢŘáÚŢŇëÜŘ ŕŢŃŢâĐÜŘ.

˝Ő×ĐÚŢÝçŐÝÝŢáâě áâĐÝÔĐŕâĐ (Standart for Robot Exclusion).

ş áŢÖĐŰŐÝŘî, ßŢáÚŢŰěÚă ßŢŘáÚŢŇëŐ áŘáâŐÜë ßŢďŇŘŰŘáě ÝŐ âĐÚ ÔĐŇÝŢ, áâĐÝÔĐŕâ ÔŰď ŕŢŃŢâŢŇ ÝĐĺŢÔŘâáď Ň áâĐÔŘŘ ŕĐ×ŕĐŃŢâÚŘ, ÔŢŕĐŃŢâÚŘ, Ýă Ř â.Ô. ÍâŢ Ţ×ÝĐçĐŐâ, çâŢ Ň ŃăÔăéŐÜ áŢŇáŐÜ ÝŐŢŃď×ĐâŐŰěÝŢ ßŢŘáÚŢŇëŐ ÜĐčŘÝë ŃăÔăâ ŘÜ ŕăÚŢŇŢÔáâŇŢŇĐâěáď.

ĂŇŐŰŘçŐÝŘŐ âŕĐäŘÚĐ.

ÍâĐ ßŕŢŃŰŐÜĐ ÝŐ áŰŘčÚŢÜ ĐÚâăĐŰěÝĐ ÔŰď ŕŢááŘŮáÚŢÓŢ áŐÚâŢŕĐ Internet, ßŢáÚŢŰěÚă ÝŐ âĐÚ ăÖ ÜÝŢÓŢ Ň ŔŢááŘŘ áŐŕŇŐŕŢŇ á âĐÚŘÜ áŐŕěŐ×ÝëÜ âŕĐäŘÚŢÜ, çâŢ ßŢáŐéŐÝŘŐ Řĺ ßŢŘáÚŢŇëÜ ŕŢŃŢâŢÜ ŃăÔŐâ ÜŐčĐâě ŢŃëçÝëÜ ßŢŰě×ŢŇĐâŐŰďÜ. ÁŢŃáâŇŐÝÝŢ, äĐŮŰ /robots.txt ÔŰď âŢÓŢ Ř ßŕŐÔÝĐ×ÝĐçŐÝ, çâŢŃë ŢÓŕĐÝŘçŘŇĐâě ÔŐŮáâŇŘď ŕŢŃŢâŢŇ.

˝Ő ŇáŐ ßŢŘáÚŢŇëŐ ŕŢŃŢâë ŘáßŢŰě×ăîâ /robots.txt.

˝Đ áŐÓŢÔÝďčÝŘŮ ÔŐÝě íâŢâ äĐŮŰ ŢŃď×ĐâŐŰěÝŢ ×ĐßŕĐčŘŇĐŐâáď ßŢŘáÚŢŇëÜŘ ŕŢŃŢâĐÜŘ âŢŰěÚŢ âĐÚŘĺ áŘáâŐÜ ÚĐÚ Altavista, Excite, Infoseek, Lycos, OpenText Ř WebCrawler.

¸áßŢŰě×ŢŇĐÝŘŐ ÜŐâĐ-âĐÓŢŇ HTML.

˝ĐçĐŰěÝëŮ ßŕŢŐÚâ, ÚŢâŢŕëŮ ŃëŰ áŢ×ÔĐÝ Ň ŕŐ×ăŰěâĐâŐ áŢÓŰĐčŐÝŘŮ ÜŐÖÔă ßŕŢÓŕĐÜÜŘáâĐÜŘ ÝŐÚŢâŢŕŢÓŢ çŘáŰĐ ÚŢÜÜŐŕçŐáÚŘĺ ŘÝÔŐÚáŘŕăîéŘĺ ŢŕÓĐÝŘ×ĐćŘŮ (Excite, Infoseek, Lycos, Opentext Ř WebCrawler) ÝĐ ÝŐÔĐŇÝŐÜ áŢŃŕĐÝŘŘ Distributing Indexing Workshop (W3C) , ÝŘÖŐ.

˝Đ íâŢÜ áŢŃŕĐÝŘŘ ŢŃáăÖÔĐŰŢáě ŘáßŢŰě×ŢŇĐÝŘŐ ÜŐâĐ-âĐÓŢŇ HTML ÔŰď ăßŕĐŇŰŐÝŘď ßŢŇŐÔŐÝŘŐÜ ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ, ÝŢ ŢÚŢÝçĐâŐŰěÝŢÓŢ áŢÓŰĐčŐÝŘď ÔŢáâŘÓÝăâŢ ÝŐ ŃëŰŢ. ąëŰŘ ŢßŕŐÔŐŰŐÝë áŰŐÔăîéŘŐ ßŕŢŃŰŐÜë ÔŰď ŢŃáăÖÔŐÝŘď Ň ŃăÔăéŐÜ:

˝ŐŢßŕŐÔŐŰŐÝÝŢáâŘ Ň áßŐćŘäŘÚĐćŘŘ äĐŮŰĐ /robots.txt
ÂŢçÝŢŐ ŢßŕŐÔŐŰŐÝŘŐ ŘáßŢŰě×ŢŇĐÝŘď ÜŐâĐ-âĐÓŢŇ HTML, ŘŰŘ ÔŢßŢŰÝŘâŐŰěÝëŐ ßŢŰď Ň äĐŮŰŐ /robots.txt
¸ÝäŢŕÜĐćŘď "Please visit"
ÂŐÚăéŘŮ ÚŢÝâŕŢŰě ŘÝäŢŕÜĐćŘŘ: ŘÝâŐŕŇĐŰ ŘŰŘ ÜĐÚáŘÜăÜ ŢâÚŕëâëĺ áŢŐÔŘÝŐÝŘŮ á áŐŕŇŐŕŢÜ, ßŕŘ ÚŢâŢŕëĺ ÜŢÖÝŢ ÝĐçŘÝĐâě ŘÝÔŐÚáŘŕŢŇĐâě áŐŕŇŐŕ.

ROBOTS ÜŐâĐ-âĐÓŘ

ÍâŢâ âĐÓ ßŕŐÔÝĐ×ÝĐçŐÝ ÔŰď ßŢŰě×ŢŇĐâŐŰŐŮ, ÚŢâŢŕëŐ ÝŐ ÜŢÓăâ ÚŢÝâŕŢŰŘŕŢŇĐâě äĐŮŰ /robots.txt ÝĐ áŇŢŘĺ ŇŐŃ-áĐŮâĐĺ. ÂĐÓ ßŢ×ŇŢŰďŐâ ×ĐÔĐâě ßŢŇŐÔŐÝŘŐ ßŢŘáÚŢŇŢÓŢ ŕŢŃŢâĐ ÔŰď ÚĐÖÔŢŮ HTML-áâŕĐÝŘćë, ŢÔÝĐÚŢ ßŕŘ íâŢÜ ÝŐŰě×ď áŢŇáŐÜ Ř×ŃŐÖĐâě ŢŃŕĐéŐÝŘď ŕŢŃŢâĐ Ú ÝŐŮ (ÚĐÚ ŇŢ×ÜŢÖÝŢ ăÚĐ×Đâě Ň äĐŮŰŐ /robots.txt).

robot_terms - íâŢ ŕĐ×ÔŐŰŐÝÝëŮ ×ĐßďâëÜŘ áßŘáŢÚ áŰŐÔăîéŘĺ ÚŰîçŐŇëĺ áŰŢŇ (×ĐÓŰĐŇÝëŐ ŘŰŘ áâŕŢçÝëŐ áŘÜŇŢŰë ŕŢŰŘ ÝŐ ŘÓŕĐîâ): ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.

NONE - ÓŢŇŢŕŘâ ŇáŐÜ ŕŢŃŢâĐÜ ŘÓÝŢŕŘŕŢŇĐâě íâă áâŕĐÝŘćă ßŕŘ ŘÝÔŐÚáĐćŘŘ (íÚŇŘŇĐŰŐÝâÝŢ ŢÔÝŢŇŕŐÜŐÝÝŢÜă ŘáßŢŰě×ŢŇĐÝŘî ÚŰîçŐŇëĺ áŰŢŇ NOINDEX, NOFOLLOW).

ALL - ŕĐ×ŕŐčĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě íâă áâŕĐÝŘćă Ř ŇáŐ ááëŰÚŘ Ř× ÝŐŐ (íÚŇŘŇĐŰŐÝâÝŢ ŢÔÝŢŇŕŐÜŐÝÝŢÜă ŘáßŢŰě×ŢŇĐÝŘî ÚŰîçŐŇëĺ áŰŢŇ INDEX, FOLLOW).

INDEX - ŕĐ×ŕŐčĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě íâă áâŕĐÝŘćă

NOINDEX - ÝŐŕĐ×ŕŐčĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě íâă áâŕĐÝŘćă

FOLLOW - ŕĐ×ŕŐčĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě ŇáŐ ááëŰÚŘ Ř× íâŢŮ áâŕĐÝŘćë

NOFOLLOW - ÝŐŕĐ×ŕŐčĐŐâ ŘÝÔŐÚáŘŕŢŇĐâě ááëŰÚŘ Ř× íâŢŮ áâŕĐÝŘćë

ľáŰŘ íâŢâ ÜŐâĐ-âĐÓ ßŕŢßăéŐÝ ŘŰŘ ÝŐ ăÚĐ×ĐÝë robot_terms, âŢ ßŢ ăÜŢŰçĐÝŘî ßŢŘáÚŢŇëŮ ŕŢŃŢâ ßŢáâăßĐŐâ ÚĐÚ ŐáŰŘ Ńë ŃëŰŘ ăÚĐ×ĐÝë robot_terms= INDEX, FOLLOW (â.Ő. ALL). ľáŰŘ Ň CONTENT ŢŃÝĐŕăÖŐÝŢ ÚŰîçŐŇŢŐ áŰŢŇŢ ALL, âŢ ŕŢŃŢâ ßŢáâăßĐŐâ áŢŢâŇŐâáâŇŐÝÝŢ, ŘÓÝŢŕŘŕăď ŇŢ×ÜŢÖÝŢ ăÚĐ×ĐÝÝëŐ ÔŕăÓŘŐ ÚŰîçŐŇëŐ áŰŢŇĐ.. ľáŰŘ Ň CONTENT ŘÜŐîâáď ßŕŢâŘŇŢßŢŰŢÖÝëŐ ßŢ áÜëáŰă ÚŰîçŐŇëŐ áŰŢŇĐ, ÝĐßŕŘÜŐŕ, FOLLOW, NOFOLLOW, âŢ ŕŢŃŢâ ßŢáâăßĐŐâ ßŢ áŇŢŐÜă ăáÜŢâŕŐÝŘî (Ň íâŢÜ áŰăçĐŐ FOLLOW).

ľáŰŘ robot_terms áŢÔŐŕÖŘâ âŢŰěÚŢ NOINDEX, âŢ ááëŰÚŘ á íâŢŮ áâŕĐÝŘćë ÝŐ ŘÝÔŐÚáŘŕăîâáď. ľáŰŘ robot_terms áŢÔŐŕÖŘâ âŢŰěÚŢ NOFOLLOW, âŢ áâŕĐÝŘćĐ ŘÝÔŐÚáŘŕăŐâáď, Đ ááëŰÚŘ, áŢŢâŇŐâáâŇŐÝÝŢ, ŘÓÝŢŕŘŕăîâáď.

KEYWORDS ÜŐâĐ-âĐÓ.

phrases - ŕĐ×ÔŐŰŐÝÝëŮ ×ĐßďâëÜŘ áßŘáŢÚ áŰŢŇ ŘŰŘ áŰŢŇŢáŢçŐâĐÝŘŮ (×ĐÓŰĐŇÝëŐ Ř áâŕŢçÝëŐ áŘÜŇŢŰë ŕŢŰŘ ÝŐ ŘÓŕĐîâ), ÚŢâŢŕëŐ ßŢÜŢÓĐîâ ŘÝÔŐÚáŘŕŢŇĐâě áâŕĐÝŘćă (â.Ő. ŢâŕĐÖĐîâ áŢÔŐŕÖĐÝŘŐ áâŕĐÝŘćë). łŕăŃŢ ÓŢŇŢŕď, íâŢ âŐ áŰŢŇĐ, Ň ŢâŇŐâ ÝĐ ÚŢâŢŕëŐ ßŢŘáÚŢŇĐď áŘáâŐÜĐ ŇëÔĐáâ íâŢâ ÔŢÚăÜŐÝâ.

DESCRIPTION ÜŐâĐ-âĐÓ.

text - âŢâ âŐÚáâ, ÚŢâŢŕëŮ ŃăÔŐâ ŇëŇŢÔŘâěáď Ň áăÜÜĐŕÝŢÜ ŢâŇŐâŐ ÝĐ ×ĐßŕŢá ßŢŰě×ŢŇĐâŐŰď Ú ßŢŘáÚŢŇŢŮ áŘáâŐÜŐ. ÁŐŮ âŐÚáâ ÝŐ ÔŢŰÖŐÝ áŢÔŐŕÖĐâě âĐÓŢŇ ŕĐ×ÜŐâÚŘ Ř ŰŢÓŘçÝŐŐ ŇáŐÓŢ ŇßŘáĐâě Ň ÝŐÓŢ áÜëáŰ ÔĐÝÝŢÓŢ ÔŢÚăÜŐÝâĐ ÝĐ ßĐŕă-âŕŢŮÚă áâŕŢÚ.

żŕŐÔßŢŰĐÓĐŐÜëŐ ŇĐŕŘĐÝâë ŘáÚŰîçŐÝŘď ßŢŇâŢŕÝëĺ ßŢáŐéŐÝŘŮ á ßŢÜŢéěî ÜŐâĐ-âĐÓŢŇ HTML

˝ŐÚŢâŢŕëŐ ÚŢÜÜŐŕçŐáÚŘŐ ßŢŘáÚŢŇëŐ ŕŢŃŢâë ăÖŐ ŘáßŢŰě×ăîâ ÜŐâĐ-âĐÓŘ, ßŢ×ŇŢŰďîéŘŐ ŢáăéŐáâŇŰďâě "áŇď×ě" ÜŐÖÔă ŕŢŃŢâŢÜ Ř ŇŐŃÜĐáâŐŕŢÜ. Altavista ŘáßŢŰě×ăŐâ KEYWORDS ÜŐâĐ-âĐÓ, Đ Infoseek ŘáßŢŰě×ăŐâ KEYWORDS Ř DESCRIPTION ÜŐâĐ-âĐÓŘ.

¸ÝÔŐÚáŘŕŢŇĐâě ÔŢÚăÜŐÝâ ŢÔŘÝ ŕĐ× ŘŰŘ ÔŐŰĐâě íâŢ ŕŐÓăŰďŕÝŢ?

˛ŐŃÜĐáâŐŕ ÜŢÖŐâ "áÚĐ×Đâě" ßŢŘáÚŢŇŢÜă ŕŢŃŢâă ŘŰŘ äĐŮŰă bookmark ßŢŰě×ŢŇĐâŐŰď, çâŢ áŢÔŐŕÖŘÜŢŐ âŢÓŢ ŘŰŘ ŘÝŢÓŢ äĐŮŰĐ ŃăÔŐâ Ř×ÜŐÝďâěáď. ˛ íâŢÜ áŰăçĐŐ ŕŢŃŢâ ÝŐ ŃăÔŐâ áŢĺŕĐÝďâě URL, Đ ŃŕŢă×Őŕ ßŢŰě×ŢŇĐâŐŰď ŇÝŐáŐâ ŘŰŘ ÝŐ ŇÝŐáŐâ íâŢ äĐŮŰ Ň bookmark. żŢÚĐ íâĐ ŘÝäŢŕÜĐćŘď ŢßŘáëŇĐŐâáď âŢŰěÚŢ Ň äĐŮŰŐ /robots.txt, ßŢŰě×ŢŇĐâŐŰě ÝŐ ŃăÔŐâ ×ÝĐâě Ţ âŢÜ, çâŢ íâĐ áâŕĐÝŘćĐ ŃăÔŐâ Ř×ÜŐÝďâěáď.

źŐâĐ-âĐÓ DOCUMENT-STATE ÜŢÖŐâ Ńëâě ßŢŰŐ×ŐÝ ÔŰď íâŢÓŢ. żŢ ăÜŢŰçĐÝŘî, íâŢâ ÜŐâĐ-âĐÓ ßŕŘÝŘÜĐŐâáď á CONTENT=STATIC.

şĐÚ ŘáÚŰîçŘâě ŘÝÔŐÚáŘŕŢŇĐÝŘŐ ÓŐÝŐŕŘŕăŐÜëĺ áâŕĐÝŘć ŘŰŘ ÔăŃŰŘŕŢŇĐÝŘŐ ÔŢÚăÜŐÝâŢŇ, ŐáŰŘ Őáâě ×ŐŕÚĐŰĐ áŐŕŇŐŕĐ?

łŐÝŐŕŘŕăŐÜëŐ áâŕĐÝŘćë - áâŕĐÝŘćë, ßŢŕŢÖÔĐŐÜëŐ ÔŐŮáâŇŘŐÜ CGI-áÚŕŘßâŢŇ. ¸ĺ ÝĐŇŐŕÝďÚĐ ÝŐ áŰŐÔăŐâ ŘÝÔŐÚáŘŕŢŇĐâě, ßŢáÚŢŰěÚă ŐáŰŘ ßŢßŕŢŃŢŇĐâě ßŕŢŇĐŰŘâěáď Ň ÝŘĺ Ř× ßŢŘáÚŢŇŢŮ áŘáâŐÜë, ŃăÔŐâ ŇëÔĐÝĐ ŢčŘŃÚĐ. ÇâŢ ÚĐáĐŐâáď ×ŐŕÚĐŰ, âŢ ÝŐÓŢÖŐ, ÚŢÓÔĐ ŇëÔĐîâáď ÔŇŐ ŕĐ×ÝëŐ ááëŰÚŘ ÝĐ ŕĐ×ÝëŐ áŐŕŇŐŕĐ, ÝŢ á ŢÔÝŘÜ Ř âŐÜ ÖŐ áŢÔŐŕÖŘÜëÜ. ÇâŢŃë íâŢÓŢ Ř×ŃŐÖĐâě, áŰŐÔăŐâ ŘáßŢŰě×ŢŇĐâě ÜŐâĐ-âĐÓ URL á ăÚĐ×ĐÝŘŐÜ ĐŃáŢŰîâÝŢÓŢ URL íâŢÓŢ ÔŢÚăÜŐÝâĐ (Ň áŰăçĐŐ ×ŐŕÚĐŰ - ÝĐ áŢŢâŇŐâáâŇăîéăî áâŕĐÝŘćă ÓŰĐŇÝŢÓŢ áŐŕŇŐŕĐ).

¸áâŢçÝŘÚŘ

Charles P.Kollar, John R.R. Leavitt, Michael Mauldin, Robot Exclusion Standard Revisited, www.kollar.com/robots.html
Martijn Koster, Standard for robot exclusion, info.webcrawler.com/mak/projects/robots/robots.html

ÁâĐÝÔĐŕâ ŘáÚŰîçŐÝŘŮ ÔŰď ŕŢŃŢâŢŇ Standard for robot exclusion

Martijn Koster , ßŐŕŐŇŢÔ °. °ŰŘÚŃŐŕŢŇĐ

ÁâĐâăá íâŢÓŢ ÔŢÚăÜŐÝâĐ
˛ŇŐÔŐÝŘŐ
˝Đ×ÝĐçŐÝŘŐ
ÄŢŕÜĐâ
żŕŘÜŐŕë
żŕŘÜŐçĐÝŘď ßŐŕŐŇŢÔçŘÚĐ
°ÔŕŐáĐ ĐŇâŢŕŢŇ

ÁâĐâăá íâŢÓŢ ÔŢÚăÜŐÝâĐ

ÍâŢâ ÔŢÚăÜŐÝâ áŢáâĐŇŰŐÝ 30 ŘîŰď 1994 ÓŢÔĐ ßŢ ÜĐâŐŕŘĐŰĐÜ ŢŃáăÖÔŐÝŘŮ Ň âŐŰŐÚŢÝäŐŕŐÝćŘŘ robots-request@nexor.co.uk (áŐŮçĐá ÚŢÝäŐŕŐÝćŘď ßŐŕŐÝŐáŐÝĐ ÝĐ WebCrawler. żŢÔŕŢŃÝŢáâŘ áÜ. Robots pages at WebCrawler info.webcrawler.com/mak/projects/robots/) ÜŐÖÔă ŃŢŰěčŘÝáâŇŢÜ ßŕŢŘ×ŇŢÔŘâŐŰŐŮ ßŢŘáÚŢŇëĺ ŕŢŃŢâŢŇ Ř ÔŕăÓŘÜŘ ×ĐŘÝâŐŕŐáŢŇĐÝÝëÜŘ ŰîÔěÜŘ.ÂĐÚÖŐ íâĐ âŐÜĐ ŢâÚŕëâĐ ÔŰď ŢŃáăÖÔŐÝŘď Ň âŐŰŐÚŢÝäŐŕŐÝćŘŘ Technical World Wide Web www-talk@info.cern.ch ÁŐŮ ÔŢÚăÜŐÝâ ŢáÝŢŇĐÝ ÝĐ ßŕŐÔëÔăéŐÜ ŕĐŃŢçŐÜ ßŕŢŐÚâŐ ßŢÔ âĐÚŘÜ ÖŐ ÝĐ×ŇĐÝŘŐÜ.

ÍâŢâ ÔŢÚăÜŐÝâ ÝŐ ďŇŰďŐâáď ŢäŘćŘĐŰěÝëÜ ŘŰŘ çěŘÜ-ŰŘŃŢ ÚŢŕßŢŕĐâŘŇÝëÜ áâĐÝÔĐŕâŢÜ, Ř ÝŐ ÓĐŕĐÝâŘŕăŐâ âŢÓŢ, çâŢ ŇáŐ ÝëÝŐčÝŘŐ Ř ŃăÔăéŘŐ ßŢŘáÚŢŇëŐ ŕŢŃŢâë ŃăÔăâ ŘáßŢŰě×ŢŇĐâě ŐÓŢ. ˛ áŢŢâŇŐâáâŇŘŘ á ÝŘÜ ŃŢŰěčŘÝáâŇŢ ßŕŢŘ×ŇŢÔŘâŐŰŐŮ ŕŢŃŢâŢŇ ßŕŐÔŰĐÓĐŐâ ŇŢ×ÜŢÖÝŢáâě ×ĐéŘâŘâě ˛ŐŃ-áŐŕŇŐŕë Ţâ ÝŐÖŐŰĐâŐŰěÝŢÓŢ ßŢáŐéŐÝŘď Řĺ ßŢŘáÚŢŇëÜŘ ŕŢŃŢâĐÜŘ.

żŢáŰŐÔÝîî ŇŐŕáŘî íâŢÓŢ ÔŢÚăÜŐÝâĐ ÜŢÖÝŢ ÝĐŮâŘ ßŢ ĐÔŕŐáă info.webcrawler.com/mak/projects/robots/robots.html

˛ŇŐÔŐÝŘŐ

żŢŘáÚŢŇëŐ ŕŢŃŢâë (wanderers, spiders) - íâŢ ßŕŢÓŕĐÜÜë, ÚŢâŢŕëŐ ŘÝÔŐÚáŘŕăîâ ŇŐŃ-áâŕĐÝŘćë Ň áŐâŘ Internet.

˛ 1993 Ř 1994 ÓŢÔĐĺ ŇëďáÝŘŰŢáě, çâŢ ŘÝÔŐÚáŘŕŢŇĐÝŘŐ ŕŢŃŢâĐÜŘ áŐŕŇŐŕŢŇ ßŢŕŢŮ ßŕŢŘáĺŢÔŘâ ßŕŢâŘŇ ÖŐŰĐÝŘď ŇŰĐÔŐŰěćŐŇ íâŘĺ áŐŕŇŐŕŢŇ. ˛ çĐáâÝŢáâŘ, ŘÝŢÓÔĐ ŕĐŃŢâĐ ŕŢŃŢâŢŇ ×ĐâŕăÔÝďŐâ ŕĐŃŢâă á áŐŕŇŐŕŢÜ ŢŃëçÝëĺ ßŢŰě×ŢŇĐâŐŰŐŮ, ŘÝŢÓÔĐ ŢÔÝŘ Ř âŐ ÖŐ äĐŮŰë ŘÝÔŐÚáŘŕăîâáď ÝŐáÚŢŰěÚŢ ŕĐ×. ˛ ÔŕăÓŘĺ áŰăçĐďĺ ŕŢŃŢâë ŘÝÔŐÚáŘŕăîâ ÝŐ âŢ, çâŢ ÝĐÔŢ, ÝĐßŕŘÜŐŕ, ŢçŐÝě "ÓŰăŃŢÚŘŐ" ŇŘŕâăĐŰěÝëŐ ÔŘŕŐÚâŢŕŘŘ, ŇŕŐÜŐÝÝăî ŘÝäŢŕÜĐćŘî ŘŰŘ CGI-áÚŕŘßâë. ÍâŢâ áâĐÝÔĐŕâ ßŕŘ×ŇĐÝ ŕŐčŘâě ßŢÔŢŃÝëŐ ßŕŢŃŰŐÜë.

˝Đ×ÝĐçŐÝŘŐ

´Űď âŢÓŢ, çâŢŃë ŘáÚŰîçŘâě ßŢáŐéŐÝŘŐ áŐŕŇŐŕĐ ŘŰŘ ŐÓŢ çĐáâŐŮ ŕŢŃŢâŢÜ ÝŐŢŃĺŢÔŘÜŢ áŢ×ÔĐâě ÝĐ áŐŕŇŐŕŐ äĐŮŰ, áŢÔŐŕÖĐéŘŮ ŘÝäŢŕÜĐćŘî ÔŰď ăßŕĐŇŰŐÝŘď ßŢŇŐÔŐÝŘŐÜ ßŢŘáÚŢŇŢÓŢ ŕŢŃŢâĐ. ÍâŢâ äĐŮŰ ÔŢŰÖŐÝ Ńëâě ÔŢáâăßŐÝ ßŢ ßŕŢâŢÚŢŰă HTTP ßŢ ŰŢÚĐŰěÝŢÜă URL /robots.txt. ÁŢÔŐŕÖĐÝŘŐ íâŢÓŢ äĐŮŰĐ áÜ. ÝŘÖŐ.

ÂĐÚŢŐ ŕŐčŐÝŘŐ ŃëŰŢ ßŕŘÝďâŢ ÔŰď âŢÓŢ, çâŢŃë ßŢŘáÚŢŇëŮ ŕŢŃŢâ ÜŢÓ ÝĐŮâŘ ßŕĐŇŘŰĐ, ŢßŘáëŇĐîéŘŐ âŕŐŃăŐÜëŐ Ţâ ÝŐÓŢ ÔŐŮáâŇŘď, ŇáŐÓŢ ŰŘčě ßŕŢáâëÜ ×ĐßŕŢáŢÜ ŢÔÝŢÓŢ äĐŮŰĐ. şŕŢÜŐ âŢÓŢ äĐŮŰ /robots.txt ŰŐÓÚŢ áŢ×ÔĐâě ÝĐ ŰîŃŢÜ Ř× áăéŐáâŇăîéŘĺ ˛ŐŃ-áŐŕŇŐŕŢŇ.

˛ëŃŢŕ ŘÜŐÝÝŢ âĐÚŢÓŢ URL ÜŢâŘŇŘŕŢŇĐÝ ÝŐáÚŢŰěÚŘÜŘ ÚŕŘâŐŕŘďÜŘ:

¸Üď äĐŮŰĐ ÔŢŰÖÝŢ ŃëŰŢ Ńëâě ŢÔŘÝĐÚŢŇëÜ ÔŰď ŰîŃŢŮ ŢßŐŕĐćŘŢÝÝŢŮ áŘáâŐÜë
ŔĐáčŘŕŐÝŘŐ ÔŰď íâŢÓŢ äĐŮŰď ÝŐ ÔŢŰÖÝŢ ŃëŰŢ âŕŐŃŢŇĐâě ÚĐÚŢŮ-ŰŘŃŢ ßŐŕŐÚŢÝäŘÓăŕĐćŘŘ áŐŕŇŐŕĐ
¸Üď äĐŮŰĐ ÔŢŰÖÝŢ ŃëŰŢ Ńëâě ŰŐÓÚŢ ×ĐßŢÜŘÝĐîéŘÜáď Ř ŢâŕĐÖĐâě ŐÓŢ ÝĐ×ÝĐçŐÝŘŐ
˛ŐŕŢďâÝŢáâě áŢŇßĐÔŐÝŘď á áăéŐáâŇăîéŘÜŘ äĐŮŰĐÜŘ ÔŢŰÖÝĐ ŃëŰĐ Ńëâě ÜŘÝŘÜĐŰěÝŢŮ

ÄŢŕÜĐâ

ÄŢŕÜĐâ Ř áŐÜĐÝâŘÚĐ äĐŮŰĐ /robots.txt áŰŐÔăîéŘŐ:

ÄĐŮŰ ÔŢŰÖŐÝ áŢÔŐŕÖĐâě ŢÔÝă ŘŰŘ ÝŐáÚŢŰěÚŢ ×ĐßŘáŐŮ (records), ŕĐ×ÔŐŰŐÝÝëĺ ŢÔÝŢŮ ŘŰŘ ÝŐáÚŢŰěÚŘÜŘ ßăáâëÜŘ áâŕŢÚĐÜŘ (ŢÚĐÝçŘŇĐîéŘÜŘáď CR, CR/NL ŘŰŘ NL). şĐÖÔĐď ×ĐßŘáě ÔŢŰÖÝĐ áŢÔŐŕÖĐâě áâŕŢÚŘ (lines) Ň äŢŕÜŐ:

"<field>:<optional_space><value><optional_space>".

żŢŰŐ <field> ďŇŰďŐâáď ŕŐÓŘáâŕŢÝŐ×ĐŇŘáŘÜëÜ.

şŢÜÜŐÝâĐŕŘŘ ÜŢÓăâ Ńëâě ŇÚŰîçŐÝë Ň äĐŮŰ Ň ŢŃëçÝŢŮ ÔŰď UNIX äŢŕÜŐ: áŘÜŇŢŰ # Ţ×ÝĐçĐŐâ ÝĐçĐŰŢ ÚŢÜÜŐÝâĐŕŘď, ÚŢÝŐć áâŕŢÚŘ - ÚŢÝŐć ÚŢÜÜŐÝâĐŕŘď.

ˇĐßŘáě ÔŢŰÖÝĐ ÝĐçŘÝĐâěáď á ŢÔÝŢŮ ŘŰŘ ÝŐáÚŢŰěÚŘĺ áâŕŢÚ User-Agent, áŰŐÔŢÜ ÔŢŰÖÝĐ Ńëâě ŢÔÝĐ ŘŰŘ ÝŐáÚŢŰěÚŢ áâŕŢÚ Disallow, äŢŕÜĐâ ÚŢâŢŕëĺ ßŕŘŇŐÔŐÝ ÝŘÖŐ. ˝ŐŕĐáßŢ×ÝĐÝÝëŐ áâŕŢÚŘ ŘÓÝŢŕŘŕăîâáď.

User-Agent

×ÝĐçŐÝŘŐÜ <value> íâŢÓŢ ßŢŰď ÔŢŰÖÝŢ ďŇŰďâěáď ŘÜď ßŢŘáÚŢŇŢÓŢ ŕŢŃŢâĐ, ÚŢâŢŕŢÜă Ň íâŢŮ ×ĐßŘáŘ ăáâĐÝĐŇŰŘŇĐîâáď ßŕĐŇĐ ÔŢáâăßĐ.
ŐáŰŘ Ň ×ĐßŘáŘ ăÚĐ×ĐÝŢ ŃŢŰŐŐ ŢÔÝŢÓŢ ŘÜŐÝŘ ŕŢŃŢâĐ, âŢ ßŕĐŇĐ ÔŢáâăßĐ ŕĐáßŕŢáâŕĐÝďîâáď ÔŰď ŇáŐĺ ăÚĐ×ĐÝÝëĺ ŘÜŐÝ.
×ĐÓŰĐŇÝëŐ ŘŰŘ áâŕŢçÝëŐ áŘÜŇŢŰë ŕŢŰŘ ÝŐ ŘÓŕĐîâ
ŐáŰŘ Ň ÚĐçŐáâŇŐ ×ÝĐç