Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.
Published by New Riders, ISBN 1-56205-463-5.
The William's book 'Bots and other Internet Beasties' was quit disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together.
Published by Sam's, ISBN: 1-57521-016-9
While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.
Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...
Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.
If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.
If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)
but its easy to be more selective than that.User-agent: * Disallow: /
The first two lines, starting with '#', specify a comment# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression.
Two common errors:
The basic idea is that if you include a tag like:
in your HTML document, that document won't be indexed.<META NAME="ROBOTS" CONTENT="NOINDEX">
If you do:
the links in that document will not be parsed by the robot.<META NAME="ROBOTS" CONTENT="NOFOLLOW">
In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.
Alternatively check out the libwww-perl5 package, that has a simple example.
Vvedenie
|ta stat'ya vovse ne yavlyaetsya popytkoj ob®yasnit', kak rabotayut poiskovye mashiny voobshche (eto know-how ih proizvoditelej). Odnako, po moemu mneniyu, ona pomozhet ponyat' kak mozhno upravlyat' povedeniem poiskovyh robotov (wanderers, spiders, robots - programmy, s pomoshch'yu kotoryh ta ili inaya poiskovaya sistema obsharivaet set' i indeksiruet vstrechayushchiesya dokumenty) i kak pravil'no postroit' strukturu servera i soderzhashchihsya na nem dokumentov, chtoby Vash server legko i horosho indeksirovalsya.
Pervoj prichinoj togo, chto ya reshilsya napisat' etu stat'yu, yavilsya sluchaj, kogda ya issledoval fajl logov dostupa k moemu serveru i obnaruzhil tam sleduyushchie dve stroki:
lycosidae.lycos.com - - [01/Mar/1997:21:27:32 -0500] "GET /robots.txt HTTP/1.0" 404 -
lycosidae.lycos.com - - [01/Mar/1997:21:27:39 -0500] "GET / HTTP/1.0" 200 3270
to est' Lycos obratilsya k moemu serveru, na pervyj zapros poluchil, chto fajla /robots.txt net, obnyuhal pervuyu stranicu, i otvalil. Estestvenno, mne eto ne ponravilos', i ya nachal vyyasnyat' chto k chemu.
Okazyvaetsya, vse "umnye" poiskovye mashiny snachala obrashchayutsya k etomu fajlu, kotoryj dolzhen prisutstvovat' na kazhdom servere. |tot fajl opisyvaet prava dostupa dlya poiskovyh robotov, prichem sushchestvuet vozmozhnost' ukazat' dlya razlichnyh robotov raznye prava. Dlya nego sushchestvuet standart pod nazvaniem Standart for Robot Exclusion.
Po mneniyu Luisa Mon'e (Louis Monier, Altavista), tol'ko 5% vseh sajtov v nastoyashchee vremya imeet ne pustye fajly /robots.txt esli voobshche oni (eti fajly) tam sushchestvuyut. |to podtverzhdaetsya informaciej, sobrannoj pri nedavnem issledovanii logov raboty robota Lycos. SHarl' Kollar (Charles P.Kollar, Lycos) pishet, chto tol'ko 6% ot vseh zaprosov na predmet /robots.txt imeyut kod rezul'tata 200. Vot neskol'ko prichin, po kotorym eto proishodit:
Fajl /robots.txt prednaznachen dlya ukazaniya vsem poiskovym robotam (spiders) indeksirovat' informacionnye servera tak, kak opredeleno v etom fajle, t.e. tol'ko te direktorii i fajly servera, kotorye NE opisany v /robots.txt. |to fajl dolzhen soderzhat' 0 ili bolee zapisej, kotorye svyazany s tem ili inym robotom (chto opredelyaetsya znacheniem polya agent_id), i ukazyvayut dlya kazhdogo robota ili dlya vseh srazu chto imenno im NE NADO indeksirovat'. Tot, kto pishet fajl /robots.txt, dolzhen ukazat' podstroku Product Token polya User-Agent, kotoruyu kazhdyj robot vydaet na HTTP-zapros indeksiruemogo servera. Naprimer, nyneshnij robot Lycos na takoj zapros vydaet v kachestve polya User-Agent:
Lycos_Spider_(Rex)/1.0 libwww/3.1
Esli robot Lycos ne nashel svoego opisaniya v /robots.txt - on postupaet tak, kak schitaet nuzhnym. Kak tol'ko robot Lycos "uvidel" v fajle /robots.txt opisanie dlya sebya - on postupaet tak, kak emu predpisano.
Pri sozdanii fajla /robots.txt sleduet uchityvat' eshche odin faktor - razmer fajla. Poskol'ku opisyvaetsya kazhdyj fajl, kotoryj ne sleduet indeksirovat', da eshche dlya mnogih tipov robotov otdel'no, pri bol'shom kolichestve ne podlezhashchih indeksirovaniyu fajlov razmer /robots.txt stanovitsya slishkom bol'shim. V etom sluchae sleduet primenyat' odin ili neskol'ko sleduyushchih sposobov sokrashcheniya razmera /robots.txt:
Zapisi (records) fajla /robots.txt
Obshchee opisanie formata zapisi.
[ # comment string NL ]*
User-Agent: [ [ WS ]+ agent_id ]+ [ [ WS ]* # comment string ]? NL
[ # comment string NL ]*
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
[
# comment string NL
|
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
]*
[ NL ]+
Opisanie parametrov, primenyaemyh v zapisyah /robots.txt
[...]+ Kvadratnye skobki so sleduyushchim za nimi znakom + oznachayut, chto v kachestve parametrov dolzhny byt' ukazany odin ili neskol'ko terminov.
Naprimer, posle "User-Agent:" cherez probel mogut byt' ukazany odin ili neskol'ko agent_id.
[...]* Kvadratnye skobki so sleduyushchim za nimi znakom * oznachayut, chto v kachestve parametrov mogut byt' ukazany nol' ili neskol'ko terminov.
Naprimer, Vy mozhete pisat' ili ne pisat' kommentarii.
[...]? Kvadratnye skobki so sleduyushchim za nimi znakom ? oznachayut, chto v kachestve parametrov mogut byt' ukazany nol' ili odin termin.
Naprimer, posle "User-Agent: agent_id" mozhet byt' napisan kommentarij.
..|.. oznachaet ili to, chto do cherty, ili to, chto posle.
WS odin iz simvolov - probel (011) ili tabulyaciya (040)
NL odin iz simvolov - konec stroki (015) , vozvrat karetki (012) ili oba etih simvola (Enter)
User-Agent: klyuchevoe slovo (zaglavnye i propisnye bukvy roli ne igrayut).
Parametrami yavlyayutsya agent_id poiskovyh robotov.
Disallow: klyuchevoe slovo (zaglavnye i propisnye bukvy roli ne igrayut).
Parametrami yavlyayutsya polnye puti k neindeksiruemym fajlam ili direktoriyam
# nachalo stroki kommentariev, comment string - sobstvenno telo kommentariya.
agent_id lyuboe kolichestvo simvolov, ne vklyuchayushchih WS i NL, kotorye opredelyayut agent_id razlichnyh poiskovyh robotov. Znak * opredelyaet vseh robotov srazu.
path_root lyuboe kolichestvo simvolov, ne vklyuchayushchih WS i NL, kotorye opredelyayut fajly i direktorii, ne podlezhashchie indeksirovaniyu.
Rasshirennye kommentarii formata.
Kazhdaya zapis' nachinaetsya so stroki User-Agent, v kotoroj opisyvaetsya kakim ili kakomu poiskovomu robotu eta zapis' prednaznachaetsya. Sleduyushchaya stroka: Disallow. Zdes' opisyvayutsya ne podlezhashchie indeksacii puti i fajly. KAZHDAYA zapis' DOLZHNA imet' kak minimum eti dve stroki (lines). Vse ostal'nye stroki yavlyayutsya opciyami. Zapis' mozhet soderzhat' lyuboe kolichestvo strok kommentariev. Kazhdaya stroka kommentariya dolzhna nachinat'sya s simvola # . Stroki kommentariev mogut byt' pomeshcheny v konec strok User-Agent i Disallow. Simvol # v konce etih strok inogda dobavlyaetsya dlya togo, chtoby ukazat' poiskovomu robotu, chto dlinnaya stroka agent_id ili path_root zakonchena. Esli v stroke User-Agent ukazano neskol'ko agent_id, to uslovie path_root v stroke Disallow budet vypolneno dlya vseh odinakovo. Ogranichenij na dlinu strok User-Agent i Disallow net. Esli poiskovyj robot ne obnaruzhil v fajle /robots.txt svoego agent_id, to on ignoriruet /robots.txt.
Esli ne uchityvat' specifiku raboty kazhdogo poiskovogo robota, mozhno ukazat' isklyucheniya dlya vseh robotov srazu. |to dostigaetsya zadaniem stroki
User-Agent: *
Esli poiskovyj robot obnaruzhit v fajle /robots.txt neskol'ko zapisej s udovletvoryayushchim ego znacheniem agent_id, to robot volen vybirat' lyubuyu iz nih.
Kazhdyj poiskovyj robot budet opredelyat' absolyutnyj URL dlya chteniya s servera s ispol'zovaniem zapisej /robots.txt. Zaglavnye i strochnye simvoly v path_root IMEYUT znachenie.
Primer 1:
User-Agent: *
Disallow: /
User-Agent: Lycos
Disallow: /cgi-bin/ /tmp/
V primere 1 fajl /robots.txt soderzhit dve zapisi. Pervaya otnositsya ko vsem poiskovym robotam i zapreshchaet indeksirovat' vse fajly. Vtoraya otnositsya k poiskovomu robotu Lycos i pri indeksirovanii im servera zapreshchaet direktorii /cgi-bin/ i /tmp/, a ostal'nye - razreshaet. Takim obrazom server budet proindeksirovan tol'ko sistemoj Lycos.
Primer 2:
User-Agent: Copernicus Fred
Disallow:
User-Agent: * Rex
Disallow: /t
V primere 2 fajl /robots.txt soderzhit dve zapisi. Pervaya razreshaet poiskovym robotam Copernicus i Fred indeksirovat' ves' server. Vtoraya - zapreshchaet vsem i osebenno robotu Rex indeksirovat' takie direktorii i fajly, kak /tmp/, /tea-time/, /top-cat.txt, /traverse.this i t.d. |to kak raz sluchaj zadaniya maski dlya direktorij i fajlov.
Primer 3:
# This is for every spider!
User-Agent: *
# stay away from this
Disallow: /spiders/not/here/ #and everything in it
Disallow: # a little nothing
Disallow: #This could be habit forming!
# Don't comments make code much more readable!!!
V primere 3 - odna zapis'. Zdes' vsem robotam zapreshchaetsya indeksirovat' direktoriyu /spiders/not/here/, vklyuchaya takie puti i fajly kak /spiders/not/here/really/, /spiders/not/here/yes/even/me.html. Odnako syuda ne vhodyat /spiders/not/ ili /spiders/not/her (v direktorii '/spiders/not/').
Nekotorye problemy, svyazannye s poiskovymi robotami.
Nezakonchennost' standarta (Standart for Robot Exclusion).
K sozhaleniyu, poskol'ku poiskovye sistemy poyavilis' ne tak davno, standart dlya robotov nahoditsya v stadii razrabotki, dorabotki, nu i t.d. |to oznachaet, chto v budushchem sovsem neobyazatel'no poiskovye mashiny budut im rukovodstvovat'sya.
Uvelichenie trafika.
|ta problema ne slishkom aktual'na dlya rossijskogo sektora Internet, poskol'ku ne tak uzh mnogo v Rossii serverov s takim ser'eznym trafikom, chto poseshchenie ih poiskovym robotom budet meshat' obychnym pol'zovatelyam. Sobstvenno, fajl /robots.txt dlya togo i prednaznachen, chtoby ogranichivat' dejstviya robotov.
Ne vse poiskovye roboty ispol'zuyut /robots.txt.
Na segodnyashnij den' etot fajl obyazatel'no zaprashivaetsya poiskovymi robotami tol'ko takih sistem kak Altavista, Excite, Infoseek, Lycos, OpenText i WebCrawler.
Ispol'zovanie meta-tagov HTML.
Nachal'nyj proekt, kotoryj byl sozdan v rezul'tate soglashenij mezhdu programmistami nekotorogo chisla kommercheskih indeksiruyushchih organizacij (Excite, Infoseek, Lycos, Opentext i WebCrawler) na nedavnem sobranii Distributing Indexing Workshop (W3C) , nizhe.
Na etom sobranii obsuzhdalos' ispol'zovanie meta-tagov HTML dlya upravleniya povedeniem poiskovyh robotov, no okonchatel'nogo soglasheniya dostignuto ne bylo. Byli opredeleny sleduyushchie problemy dlya obsuzhdeniya v budushchem:
|tot tag prednaznachen dlya pol'zovatelej, kotorye ne mogut kontrolirovat' fajl /robots.txt na svoih veb-sajtah. Tag pozvolyaet zadat' povedenie poiskovogo robota dlya kazhdoj HTML-stranicy, odnako pri etom nel'zya sovsem izbezhat' obrashcheniya robota k nej (kak vozmozhno ukazat' v fajle /robots.txt).
<META NAME="ROBOTS" CONTENT="robot_terms">
robot_terms - eto razdelennyj zapyatymi spisok sleduyushchih klyuchevyh
slov (zaglavnye ili strochnye simvoly roli ne igrayut): ALL, NONE,
INDEX, NOINDEX, FOLLOW, NOFOLLOW.
NONE - govorit vsem robotam ignorirovat' etu stranicu pri indeksacii (ekvivalentno odnovremennomu ispol'zovaniyu klyuchevyh slov NOINDEX, NOFOLLOW).
ALL - razreshaet indeksirovat' etu stranicu i vse ssylki iz nee (ekvivalentno odnovremennomu ispol'zovaniyu klyuchevyh slov INDEX, FOLLOW).
INDEX - razreshaet indeksirovat' etu stranicu
NOINDEX - nerazreshaet indeksirovat' etu stranicu
FOLLOW - razreshaet indeksirovat' vse ssylki iz etoj stranicy
NOFOLLOW - nerazreshaet indeksirovat' ssylki iz etoj stranicy
Esli etot meta-tag propushchen ili ne ukazany robot_terms, to po umolchaniyu poiskovyj robot postupaet kak esli by byli ukazany robot_terms= INDEX, FOLLOW (t.e. ALL). Esli v CONTENT obnaruzheno klyuchevoe slovo ALL, to robot postupaet sootvetstvenno, ignoriruya vozmozhno ukazannye drugie klyuchevye slova.. Esli v CONTENT imeyutsya protivopolozhnye po smyslu klyuchevye slova, naprimer, FOLLOW, NOFOLLOW, to robot postupaet po svoemu usmotreniyu (v etom sluchae FOLLOW).
Esli robot_terms soderzhit tol'ko NOINDEX, to ssylki s etoj stranicy ne indeksiruyutsya. Esli robot_terms soderzhit tol'ko NOFOLLOW, to stranica indeksiruetsya, a ssylki, sootvetstvenno, ignoriruyutsya.
<META NAME="KEYWORDS" CONTENT="phrases">
phrases - razdelennyj zapyatymi spisok slov ili slovosochetanij (zaglavnye i strochnye simvoly roli ne igrayut), kotorye pomogayut indeksirovat' stranicu (t.e. otrazhayut soderzhanie stranicy). Grubo govorya, eto te slova, v otvet na kotorye poiskovaya sistema vydast etot dokument.
<META NAME="DESCRIPTION" CONTENT="text">
text - tot tekst, kotoryj budet vyvodit'sya v summarnom otvete na zapros pol'zovatelya k poiskovoj sisteme. Sej tekst ne dolzhen soderzhat' tagov razmetki i logichnee vsego vpisat' v nego smysl dannogo dokumenta na paru-trojku strok.
Predpolagaemye varianty isklyucheniya povtornyh poseshchenij s pomoshch'yu meta-tagov HTML
Nekotorye kommercheskie poiskovye roboty uzhe ispol'zuyut meta-tagi, pozvolyayushchie osushchestvlyat' "svyaz'" mezhdu robotom i vebmasterom. Altavista ispol'zuet KEYWORDS meta-tag, a Infoseek ispol'zuet KEYWORDS i DESCRIPTION meta-tagi.
Indeksirovat' dokument odin raz ili delat' eto regulyarno?
Vebmaster mozhet "skazat'" poiskovomu robotu ili fajlu bookmark pol'zovatelya, chto soderzhimoe togo ili inogo fajla budet izmenyat'sya. V etom sluchae robot ne budet sohranyat' URL, a brouzer pol'zovatelya vneset ili ne vneset eto fajl v bookmark. Poka eta informaciya opisyvaetsya tol'ko v fajle /robots.txt, pol'zovatel' ne budet znat' o tom, chto eta stranica budet izmenyat'sya.
Meta-tag DOCUMENT-STATE mozhet byt' polezen dlya etogo. Po umolchaniyu, etot meta-tag prinimaetsya s CONTENT=STATIC.
<META NAME="DOCUMENT-STATE" CONTENT="STATIC">
<META NAME="DOCUMENT-STATE" CONTENT="DYNAMIC">
Kak isklyuchit' indeksirovanie generiruemyh stranic ili dublirovanie dokumentov, esli est' zerkala servera?
Generiruemye stranicy - stranicy, porozhdaemye dejstviem CGI-skriptov. Ih navernyaka ne sleduet indeksirovat', poskol'ku esli poprobovat' provalit'sya v nih iz poiskovoj sistemy, budet vydana oshibka. CHto kasaetsya zerkal, to negozhe, kogda vydayutsya dve raznye ssylki na raznye servera, no s odnim i tem zhe soderzhimym. CHtoby etogo izbezhat', sleduet ispol'zovat' meta-tag URL s ukazaniem absolyutnogo URL etogo dokumenta (v sluchae zerkal - na sootvetstvuyushchuyu stranicu glavnogo servera).
<META NAME="URL" CONTENT="absolute_url">
Martijn Koster , perevod A. Alikberova
|tot dokument sostavlen 30 iyulya 1994 goda po materialam obsuzhdenij v telekonferencii robots-request@nexor.co.uk (sejchas konferenciya perenesena na WebCrawler. Podrobnosti sm. Robots pages at WebCrawler info.webcrawler.com/mak/projects/robots/) mezhdu bol'shinstvom proizvoditelej poiskovyh robotov i drugimi zainteresovannymi lyud'mi.Takzhe eta tema otkryta dlya obsuzhdeniya v telekonferencii Technical World Wide Web www-talk@info.cern.ch Sej dokument osnovan na predydushchem rabochem proekte pod takim zhe nazvaniem.
|tot dokument ne yavlyaetsya oficial'nym ili ch'im-libo korporativnym standartom, i ne garantiruet togo, chto vse nyneshnie i budushchie poiskovye roboty budut ispol'zovat' ego. V sootvetstvii s nim bol'shinstvo proizvoditelej robotov predlagaet vozmozhnost' zashchitit' Veb-servery ot nezhelatel'nogo poseshcheniya ih poiskovymi robotami.
Poslednyuyu versiyu etogo dokumenta mozhno najti po adresu info.webcrawler.com/mak/projects/robots/robots.html
Poiskovye roboty (wanderers, spiders) - eto programmy, kotorye indeksiruyut veb-stranicy v seti Internet.
V 1993 i 1994 godah vyyasnilos', chto indeksirovanie robotami serverov poroj proishodit protiv zhelaniya vladel'cev etih serverov. V chastnosti, inogda rabota robotov zatrudnyaet rabotu s serverom obychnyh pol'zovatelej, inogda odni i te zhe fajly indeksiruyutsya neskol'ko raz. V drugih sluchayah roboty indeksiruyut ne to, chto nado, naprimer, ochen' "glubokie" virtual'nye direktorii, vremennuyu informaciyu ili CGI-skripty. |tot standart prizvan reshit' podobnye problemy.
Dlya togo, chtoby isklyuchit' poseshchenie servera ili ego chastej robotom neobhodimo sozdat' na servere fajl, soderzhashchij informaciyu dlya upravleniya povedeniem poiskovogo robota. |tot fajl dolzhen byt' dostupen po protokolu HTTP po lokal'nomu URL /robots.txt. Soderzhanie etogo fajla sm. nizhe.
Takoe reshenie bylo prinyato dlya togo, chtoby poiskovyj robot mog najti pravila, opisyvayushchie trebuemye ot nego dejstviya, vsego lish' prostym zaprosom odnogo fajla. Krome togo fajl /robots.txt legko sozdat' na lyubom iz sushchestvuyushchih Veb-serverov.
Vybor imenno takogo URL motivirovan neskol'kimi kriteriyami:
Format i semantika fajla /robots.txt sleduyushchie:
Fajl dolzhen soderzhat' odnu ili neskol'ko zapisej (records), razdelennyh odnoj ili neskol'kimi pustymi strokami (okanchivayushchimisya CR, CR/NL ili NL). Kazhdaya zapis' dolzhna soderzhat' stroki (lines) v forme:
"<field>:<optional_space><value><optional_space>".
Pole <field> yavlyaetsya registronezavisimym.
Kommentarii mogut byt' vklyucheny v fajl v obychnoj dlya UNIX forme: simvol # oznachaet nachalo kommentariya, konec stroki - konec kommentariya.
Zapis' dolzhna nachinat'sya s odnoj ili neskol'kih strok User-Agent, sledom dolzhna byt' odna ili neskol'ko strok Disallow, format kotoryh priveden nizhe. Neraspoznannye stroki ignoriruyutsya.
User-Agent
Disallow
Lyubaya zapis' (record) dolzhna sostoyat' hotya by iz odnoj stroki (line) User-Agent i odnoj - Disallow
Esli fajl /robots.txt pust, ili ne otvechaet zadannomu formatu i semantike, ili ego ne sushchestvuet, lyuboj poiskovyj robot budet rabotat' po svoemu algoritmu.
Primer 1:
# robots.txt for http://www.site.com User-Agent: * Disallow: /cyberworld/map/ # this is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear
V primere 1 zakryvaetsya ot indeksacii soderzhimoe direktorij /cyberworld/map/ i /tmp/.
Primer 2:
# robots.txt for http://www.site.com User-Agent: * Disallow: /cyberworld/map/ # this is an infinite virtual URL space # Cybermapper knows where to go User-Agent: cybermapper Disallow:
V primere 2 zakryvaetsya ot indeksacii soderzhimoe direktorii /cyberworld/map/, odnako poiskovomu robotu cybermapper vse razresheno.
Primer 3:
# robots.txt for http://www.site.com User-Agent: * Disallow: /
V primere 3 lyubomu poiskovomu robotu zapreshchaetsya indeksirovat' server.
V nastoyashchee vremya standart neskol'ko izmenilsya, naprimer, mozhno zapisyvat' v stroke User-Agent neskol'ko imen robotov, razdelennyh probelami ili tabulyatorami.
Martijn Koster, m.koster@webcrawler.com
Perevod: Andrej Alikberov, info@citmgu.ru
Last-modified: Thu, 28 May 1998 14:14:26 GMT