SearchImagesVideoNewsMapsMail
Login
Ctrl previousnext Ctrl

Using robots.txt

User-agent directive

You can control Yandex robot's access to your site using robots.txt file that must reside in the root directory. Yandex robot supports the specification with extensions described below.

The work of Yandex robot is session-based. In each session, the robot generates a pool of pages that it plans to download. The session starts with downloading the robots.txt file of the site. If the file is missing or the response to the robot's request is anything different from HHTP code '200', the robot takes it as a sign that the access is not restricted in any way. The robot checks the file robots.txt for entries starting with 'User-agent:', searching for the following substrings: 'Yandex' or '*' (case insensitive). If 'User-agent: Yandex' is found, directives for 'User-agent: *' are disregarded. If the 'User-agent: Yandex' and 'User-agent: *' entries are missing, the robot takes it as a sign that the access is not restricted in any way.

You can specify the following directives targeting specific Yandex robots:

  • 'YandexBot' — the main indexing robot;

  • 'YandexMedia' — the robot that indexes multimedia data;

  • 'YandexImages' — indexer of Yandex.Images;

  • 'YandexCatalog' — Yandex.Catalog robot;

  • 'YaDirectFetcher'Yandex.Direct robot, interpretes robots.txt differently;

  • 'YandexBlogs' — the blog search robot that indexes comments to the posts;

  • 'YandexNews' — Yandex.News robot;

  • 'YandexPagechecker' — a robot that accesses the page when microformats are validated using the Microformats validator form;

  • ‘YandexMetrika’Yandex.Metrica robot;

  • ‘YandexMarket’— Yandex.Market robot;

  • ‘YandexCalendar’ — Yandex.Calendar robot.

For all of those, the rule applies: if directives for a specific robot are found, the 'User-agent: Yandex' and 'User-agent: *' directives are disregarded.

Example:

User-agent: * # this directive will not be used by Yandex robots

Disallow: /cgi-bin 



User-agent: Yandex # this directive will be used by all Yandex robots

Disallow: /*sid= # except for the main indexing one



User-agent: YandexBot # will be used only by the main indexing robot

Disallow: /*id=

Using Disallow and Allow directives

Use the 'Disallow' directive to restrict the robot access to specific parts of the site or to the entire site. Examples:

User-agent: Yandex

Disallow: / # blocks access to the entire site



User-agent: Yandex

Disallow: /cgi-bin # blocks access to the pages 
                   #with paths starting with '/cgi-bin'

Note:

It is not acceptable to have empty line breaks between 'User-agent' and 'Disallow' ('Allow') directives, as well as between different 'Disallow' ('Allow') directives.

In addition to that, it is recommended, according to the standard, to insert an empty line break before each 'User-agent' directive.

The '#' character is used for comments. Everything following this comment, up to the first line break, is disregarded.

Use the 'Disallow' directive to restrict the robot access to specific parts of the site or to the entire site. Examples:

User-agent: Yandex

Allow: /cgi-bin

Disallow: /

# disallows to download anything except for the pages 
# with paths that begin with '/cgi-bin'

Using directives jointly

If several directives match a particular site page, the first one that appears in the selected User-agent block is selected. Examples:

User-agent: Yandex

Allow: /cgi-bin

Disallow: /

# disallows to download anything except for the pages 
# with paths that begin with '/cgi-bin'
User-agent: Yandex

Disallow: /

Allow: /cgi-bin

# disallows to download anything at all from the site

Using Disallow and Allow directives without parameters

If there are no parameters specified for a directive, this is interpreted as follows:

User-agent: Yandex

Disallow: # same as Allow: /



User-agent: Yandex

Allow: # same as Disallow: /

Using special characters "*" and "$"

You can use special characters '*' и '$' for path specification in Allow-Disallow directives, specifying certain regular expressions this way. The '*' special character stands for any (including empty) character sequence. Examples:

User-agent: Yandex

Disallow: /cgi-bin/*.aspx # disallows access to '/cgi-bin/example.aspx'
                          # and '/cgi-bin/private/test.aspx'

Disallow: /*private # disallows not only  '/private',
                    # but also '/cgi-bin/private'

'$' special character

By default '*' is appended to the end of each rule contained in robots.txt, for example:

User-agent: Yandex

Disallow: /cgi-bin* # blocks access to the pages 
                    # with paths that begin with '/cgi-bin'

Disallow: /cgi-bin # the same

To cancel '*' appended to the end of the rule, use the '$' special character, for example:

User-agent: Yandex

Disallow: /example$ # disallows '/example', 
                    # but does not disallow '/example.html'
User-agent: Yandex

Disallow: /example # disallows both '/example' 
                   # and '/example.html'
User-agent: Yandex

Disallow: /example$ # only disallows '/example'

Disallow: /example*$ # similar to  'Disallow: /example' 
                     #disallows both /example.html and /example

Sitemap directive

If you use a Sitemap XML file to share information about the URLs on your site available for indexing, and would like to share this information with our indexing robot, please provide the location of the sitemap file in the 'Sitemap' (include all the files if you have more than one) directive of your robots.txt:

User-agent: Yandex 

Allow: / 

Sitemap: http://mysite.ru/site_structure/my_sitemaps1.xml 
Sitemap: http://mysite.ru/site_structure/my_sitemaps2.xml

or

User-agent: Yandex 

Allow: / User-agent: * 

Disallow: / 

Sitemap: http://mysite.ru/site_structure/my_sitemaps1.xml 
Sitemap: http://mysite.ru/site_structure/my_sitemaps2.xml

Our indexing robot will remember the location of your sitemaps.xml file and process the contents each time it visits your site.

Host directive

If your site has mirrors, a special mirroring robot will locate them and generate a group of mirrors for your site. Only the main mirror will participate in the search. You can specify it for all the mirrors in the robots.txt file using the 'Host' directive. Specify the name of the main mirror as the directive parameter. The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision making algorithm takes it into account with high priority. Example:

#Let's assume that www.main-mirror.com is the main mirror of the site. Then,  
#robots.txt for all the sites from the mirror group will look as follows: 

User-Agent: *

Disallow: /forum

Disallow: /cgi-bin

Host: www.main-mirror.com

THIS IS IMPORTANT: To achieve compatibility with robots that somewhat deviate from standard behaviour when processing robots.txt, the 'Host' directive must be added to the group that starts from the 'User-Agent' entry, right after the 'Disallow'('Allow') directive(s). The 'Host' directive takes as an argument a domain name with port number (80 by default), separated by a colon.

# Example of a well-formed robots.txt, during parsing  
# of which the Host directive will be taken into account

User-Agent: *

Disallow:

Host: www.myhost.ru

However, the Host directive is an intersectional one, so it will be used by the robot regardless of its location in robots.txt.

THIS IS IMPORTANT: Only one Host directive is allowed in robots.txt. If several directives are specified, only one of them will be used.

Example:

Host: myhost.ru # used

User-agent: *

Disallow: /cgi-bin

User-agent: Yandex

Disallow: /cgi-bin

Host: www.myhost.ru # not used

THIS IS IMPORTANT: the parameter of the Host directive must contain one well-formed host name (i.e. the one compliant with RFC 952 and not an IP address) and a valid port number. Badly formed 'Host:' lines will be ignored.

# Example of Host directives that will be ignored

Host: www.myhost-.ru

Host: www.-myhost.ru

Host: www.myhost.ru:100000

Host: www.my_host.ru

Host: .my-host.ru:8000

Host: my-host.ru.

Host: my..host.ru

Host: www.myhost.ru/

Host: www.myhost.ru:8080/

Host: http://www.myhost.ru

Host: 213.180.194.129

Host: www.firsthost.ru,www.secondhost.ru

Host: www.firsthost.ru www.secondhost.ru

Examples of using Host directive:

# if domain.myhost.ru is the main mirror for 
# www.domain.myhost.ru, then the correct usage of  
# Host directive will be as follows:

User-Agent: *

Disallow:

Host: domain.myhost.ru

# if domain.myhost.ru is the main mirror for 
# www.domain.myhost.ru, then incorrect usage of  
# Host directive will be as follows:

User-Agent: *

Disallow:

Host: myhost.ru

Additional information

Yandex robot does not support directives for robots.txt that are not mentioned in the present document.

Please take into account that the result of using robots.txt format extensions may be different from the result obtained not using the extensions, i.e.:

User-agent: Yandex 

Allow: /

Disallow: /

# when extensions are not used, this disallows everything because 'Allow: /' was ignored, 
# while, when the extensions are supported, everything is allowed

User-agent: Yandex

Disallow: /private*html

# when extensions are not used, this disallows '/private*html', 
# and when extensions are supported, this also disallows '/private*html', 
# '/private/test.html', '/private/html/test.aspx' etc.

User-agent: Yandex

Disallow: /private$

# when extensions are not used, this '/private$', '/private$test', etc. 
# and when extensions are supported, this only disallows '/private'

User-agent: *

Disallow: /

User-agent: Yandex

Allow: /

# when extensions are not supported, then, because there is no line break, 
# 'User-agent: Yandex' would be ignored and  
# the result would be 'Disallow: /', but Yandex robot  
# identifies entries because of the  'User-agent:' substring, 
# and the result for Yandex robot in this particular case is 'Allow: /'

Examples of usage for robots.txt extended format:

User-agent: Yandex

Allow: /archive

Disallow: /

# allows everything in '/archive' and disallows all the rest

User-agent: Yandex

Allow: /obsolete/private/*.html$ # allows html files 
                                 # with the path of  '/obsolete/private/...'

Disallow: /*.php$  # disallows all '*.php' on this site

Disallow: /*/private/ # disallows all subpaths containing
                      # '/private/', but the Allow directive located higher countermands 
                      # a part of this restriction

Disallow: /*/old/*.zip$ # disallows all '*.zip' files whose path contains  
                        # '/old/'

User-agent: Yandex

Disallow: /add.php?*user= 

# disallows all 'add.php?' scripts with a parameter of  'user'

When creating robots.txt file, remember that there is a limit on the size of the file the robot can process. The robots.txt files that are too big (over 32 KB) are interpreted as unrestricting, i.e. they are regarded as equivalent to the following:

User-agent: Yandex

Disallow:

The robots.txt files that the robot was unable to download (for example, due to incorrect http headers) or those that return error 404, are regarded as unrestricting.

To validate robots.txt file you can use a special online analyzer. See the description of the analyzer.

Crawl-delay directive

If the server is overloaded and does not have enough time to process downloading requests, use the Crawl-delay directive. It enables you to specify the minimum interval (in seconds) for a search robot to wait after downloading one page, before starting to download another. To achieve compatibility with robots that somewhat deviate from standard behaviour when processing robots.txt, the Crawl-delay directive must be added to the group that starts from the 'User-Agent' entry, right after the 'Disallow'('Allow') directive(s).

Yandex search robot supports fractional values for Crawl-Delay, e.g. 0.5. It does not mean that the search robot will access your site every half a second, but it gives the robot more freedom and may speed up the site processing.

Examples:

User-agent: Yandex

Crawl-delay: 2 # specifies a delay of 2 seconds 

User-agent: *

Disallow: /search

Crawl-delay: 4.5 # specifies a delay of 4.5 seconds 

Clean-param directive

If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the 'Clean-param' directive. Using this information, Yandex robot will not reload duplicating information again. Thus the efficiency of your website processing by robot will increase, and the server load, decrease.

For example, your site has the following pages:

www.site.ru/some_dir/get_book.pl?ref=site_1&book_id=123

www.site.ru/some_dir/get_book.pl?ref=site_2&book_id=123

www.site.ru/some_dir/get_book.pl?ref=site_3&book_id=123

the 'ref=' parameter is only used to track the resource from which the request was sent, and does not change the content. The same book, 'book_id=123', will be displayed at all the three addresses. Then, if you specify the following in robots.txt:

Clean-param: ref /some_dir/get_book.pl

like this:

User-agent: Yandex

Disallow:

Clean-param: ref /some_dir/get_book.pl

Yandex robot will converge all the page addresses into one:

www.site.ru/some_dir/get_book.pl?ref=site_1&book_id=123,

If a parameterless page is available on the site,

www.site.ru/some_dir/get_book.pl?book_id=123

then, after the robot indexes it, other pages will be converted into it. Other pages of your site will be traversed more often, because there will be no need to traverse the following pages:

www.site.ru/some_dir/get_book.pl?ref=site_2&book_id=123

www.site.ru/some_dir/get_book.pl?ref=site_3&book_id=123

Syntax for using the directive:

Clean-param: p0[&p1&p2&..&pn] [path]

In the first field you list the parameters that must be disregarded, delimited with '&'. In the second field, specify the path prefix for the pages to which the rule must be applied.

THIS IS IMPORTANT: директива Clean-Param is an intersectional one, so it will be used by the robot regardless of its location in robots.txt. If several directives are specified, all of them will be taken into account by the robot.

Note:

A prefix may contain a regular expression in a format similar to the one used in robots.txt, with some restrictions: the only characters allowed are A-Za-z0-9.-/*_. * is interpreted in the same way as in robots.txt. A '*' is always implicitly appended to the end of the prefix, i.e.:

Clean-param: s /forum/showthread.php

means that the s will be disregarded for all the URLs that begin with /forum/showthread.php. The second field is optional. If it is omitted, the rule is applied for all the pages of the site. Everything is case sensitive. A rule cannot exceed 500 character in length. For example:

Clean-param: abc /forum/showthread.php

Clean-param: sid&sort /forumt/*.php

Clean-param: someTrash&otherTrash

Additional examples:

# for addresses of the following type:

www.site1.ru/forum/showthread.php?s=681498b9648949605&t=8243

www.site1.ru/forum/showthread.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain:

User-agent: Yandex

Disallow:

Clean-param: s /forum/showthread.php
# for addresses of the following type:

www.site2.ru/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df

www.site2.ru/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae

#robots.txt will contain:

User-agent: Yandex

Disallow:

Clean-param: sid /index.php
#if there is more than one of such parameters:

www.site1.ru/forum_old/showthread.php?s=681498605&t=8243&ref=1311

www.site1.ru/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896

#robots.txt will contain:

User-agent: Yandex

Disallow:

Clean-param: s&ref /forum*/showthread.php
#if a parameter is used in more than one scripts:

www.site1.ru/forum/showthread.php?s=681498b9648949605&t=8243

www.site1.ru/forum/index.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain:

User-agent: Yandex

Disallow:

Clean-param: s /forum/index.php

Clean-param: s /forum/showthread.php

What is a robots.txt file?

Robots.txt is a text file that resides on your site and is intended for search engine robots. In this file, the web master can specify the site indexing parameters for all the robots or for each search engine in particular.

How to create robots.txt

Using any text editor (Notepad or Wordpad), create a file named robots.txt and fill it according to the rules presented below. Then place the file in the root catalog of your site.

To make sure that your robots.txt file will be processed correctly, use the robots.txt analyzer.

Exceptions

A number of Yandex robots download web documents for purposes other than indexing (User-agent: *). These robots are not subject to generic robots.txt restrictions (User-agent: *) to avoid being unintentionally blocked by site owners. Some robots.txt restrictions on certain sites may also be ignored if an agreement has been reached between Yandex and the site owners.

Important: if one of these robots downloads a webpage not normally accessible by the Yandex indexing robot, this webpage will never be indexed and will not appear in our search results.

List of Yandex robots not subject to standard robots.txt restrictions:

  • YaDirectFetcher downloads ad landing pages to check their availability and content. This is compulsory for placing ads in Yandex search results and YAN partner sites;

  • YandexCalendar regularly downloads calendar files requested by users, despite being located in directories that are blocked from indexing.

To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:

User-agent: YaDirectFetcher
Disallow: /
User-agent: YandexCalendar
Disallow: /*.ics$
Keyboard