During creating and further monitoring of the websites the robots.txt instructions are always taken into account.

About the robots.txt file. The Exclusion Standard for the Robots (robots.txt) – it is file restricting the robots’ access to the content on http-server. The file has to be in the root of the website (ie, has to have a path in relation on the website’s name /robots.txt). If there are some sub-domains the file has to be located in the root directory of each of them. This file complements the Sitemaps standard. More information is on the wikipedia.org.

The instructions consideration in the robots.txt in the SORGE enables to build the actual website map more accurately, ignoring the sections considered by the website owners as insignificant.

Only instructions identified clearly for the search engines Google, Yandex, Bing and others (User-agent: *) are considered. The instructions designated unpopular search engines such as Rambler, are ignored.

Only the Allow (allow for monitoring) and Disallow (disable for the monitoring) commands are considered by the instructions. If there are no instructions in the robots.txt or not the file on the website, the project will monitor all pages.

An important point when you create the Projects with instruction blocking the entire website.

When you create the Project, the Service checks automatically if the robots.txt file is on the website and if the command Disallow is:/ (the blocking of the indexing of the website entire content). If the command is found, the warning window appears in wizard that the website contains the content blocking instruction. You can create a new Project or click “Ignore” - in this case the Project will be created, but any instructions to set up the Project will be ignored completely.


Robots.txt command in Sorge system
The window warning that in robots.txt there is the restrictive instruction for the website indexing.

The important point when you upload the links manually.

All uploaded manually links will be ignore the rules in robots.txt, because you have specified them for monitor clearly.

If during the monitoring process you want to disable the robots.txt for specific website, please contact support team.

Read more: Changelog

I know how to improve this documentation page