By default, SEO projects will use the live robots.txt for each unique hostname found when crawling pages and resources for each page. Each hostname's robots.txt will be fetched separately at the start of the crawl, and refreshed periodically during the crawl in case the robots.txt file is updated mid-crawl to ensure we are obeying the latest restrictions.
Accessibility crawls will ignore the robots.txt for both pages and resources by default because the crawl is intended to simulate a human using a browser which does not respect robots.txt.
You may want to modify the crawler's behavior for the following use cases:
Testing a new robots.txt before it's pushed live
Applying a new robots.txt as an overwrite allows you to see the impact on the site architecture, rather than just testing a list of existing URLs. A few URLs changing from disallowed to allowed could have a large impact on the total number of URLs found during the crawl, which cannot be simulated by just testing existing known URLs.
Find new pages that might be crawled
Running a crawl with a less restrictive robots.txt file than is present on the live site may result in the discovery of new pages which you would like to be discovered by search engines, but all paths to them are via disallowed URLs. This can only be seen with a full crawl that does not obey the robots.txt.
Crawling a disallowed staging site
Staging sites are often prevented from being indexed in search engines, with every staging URL being disallowed in the domain’s robots.txt file. Using the robots.txt Overwrite function in Lumar, this change is really easy and you can crawl your staging site just like a live site.
Matching Robots.txt Rules
The Robots.txt rule blocks are matched based on the project's user-agent token setting, visible in the project's user-agent settings.
An appropriate user-agent token is provided for each of the predefined user-agent options, but can be customized using the custom user-agent setting.
The Robots.txt Overwrite
You can add a modified robots.txt file into the Robots Overwrite field in Advanced Settings, making sure to check the ‘Use Robots Overwrite’ checkbox before you crawl:
The modified robots.txt will be used for any subsequent crawls which are run, and any URLs which match a disallowed directive for the project's user-agent token will be reported as disallowed.
Crawling Disallowed URLs
Disallowed URLs will not be crawled by default, and a sample of them will be reported in the Disallowed URLs report. It is possible to crawl Disallowed URLs in order to gather additional metrics, but keep them reported separately as disallowed Pages.
To do this you must enable the 'Check disallowed links' setting found under Advanced Settings > Link Validation.
Robots.txt for Resources
The robots can be ignored for resources fetched during rending by enabling the 'Ignore robots.txt for resources' setting found under Spider Settings > JavaScript Rendering.
The Robots.txt Overwrite function, found in Advanced Settings, allows you to crawl a site using a different robots.txt to the live version.