Advanced Crawl Settings – Lumar

Lumar's Advanced Settings allow you to customize the crawl to meet the exact parameters you need. In this article, we'll give you an overview of the settings available. In step 4 of the crawl setup process you'll see an 'Advanced Settings' button. Click this to open up the options available. As you enable different settings

Screenshot of step 4 of the crawl setup process with the Advanced Settings button highlighted.

Scope

Domain Scope

This setting allows you to set the primary domain, whether sub-domains and both HTTP and HTTPS will be crawled, and any secondary domains that will be crawled. These may have been set in steps 1 and 2 of the crawl setup, but can be changed here if needed.

Screenshot of the Domain Scope setting in Advanced Settings, showing the primary domain, checkboxes to specify whether to crawl sub-domains and HTTP and HTTPS, and a field to enter any secondary domains.

URL Scope

Here, you can choose to include only specific URL paths or exclude specific URL paths.

It also allows you to create page groups. Add a name and a regular expression in the 'Page URL Match' column, and select the percentage of URLs that you would like to crawl. URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the 'Page Group Restrictions' report and are not crawled.

Screenshot of the URL scope section of advanced settings in step 4 of the crawl setup process. This shows the options for include only URLs, exclude URLs, and page grouping.

Resource Restrictions

Here you can define which types of URLs you want Lumar to crawl (e.g. non-HTML, internal and/or external CSS or JS resources, images, etc.). You can also set Lumar to ignore an invalid SSL certificate.

Screenshot of the Resource Restrictions option in Advanced Settings, showing checkboxes to specify whether to crawl non-HTML URLs, CSS resources, JS resources, and image resources, drop downs to specify whether to crawl internal, external or both, and a checkbox to specify whether to ignore invalid SSL certificates.

Link Restrictions

This setting allows you to define which links you want Lumar to crawl (e.g. follow anchor links, pagination links, etc.).

Screenshot of the Link Restrictions option in Advanced Settings, showing checkboxes to disable crawling of internal anchor links, nofollow anchor links, canonical links, pagination links, mobile alternate links, AMPHTML links, and hreflang links.

Redirect Settings

Here you can choose whether to follow internal or external redirects.

Screenshot of the Redirect Settings option in Advanced Settings, showing checkboxes to disable following of redirects or external redirects.

Link Validation

Here you can choose which links are crawled to see if they are responding correctly.

Screenshot of the Link Validation option in Advanced Settings, showing checkboxes to disable checking of disallowed links, excluded links, external anchor links, external canonical links, external pagination links, external mobile alternate links, external AMPHTML links, and external hreflang links.

Spider Settings

Start URLs

By default, the crawl will start from your primary domain, but you can set it to start from a different point, or multiple points, here. This would have been set in Step 2 of the crawl setup process, but can be accessed and changed here.

Screenshot of the Start URLs option in Advanced Settings showing a field to enter a different domain, or multiple domains, to start the crawl from.

JavaScript Rendering

Here you can enable or disable JavaScript rendering, as well as change some other settings such as:

Opting to render with images, which can improve the accuracy of page performance metrics, 3rd party cookie metrics, and provide more accurate screenshots. Note that this will consume an additional 0.1 credits per rendered page.
Disable Flatten Shadow DOM and iFrames. These features are enabled by default to ensure crawl data is as aligned to search engines as possible. If you do need to disable either of these, you can uncheck the relevant box.

You can also add any block ad or analytics scripts, block 3rd party cookies, and add any custom rejections, any additional custom JavaScript, and any external JavaScript resources.

Screenshot of the JavaScript Rendering option in Advanced Settings showing checkboxes to enable JavaScript rendering, flatten the shadow DOM, flatten iframes, block ad scripts, block analytics scripts, and block third part cookies, plus the option to change the render timeout, and fields to enter custom rejections or custom JavaScript.

Crawler IP Settings

Here you can select regional IPs if required. If your crawl is blocked, or you need to crawl behind a firewall (e.g. a staging environment), you will need to ask your web team to whitelist 52.5.118.182 and 52.86.188.211.

Screenshot of the Crawler IP Settings in Advanced Settings, showing a radio button to use the default IP address, or regional IPs for Australia, Austria, Belgium, Brazil, Canada, China, Egypt, France, Germany, India, Italy, Japan, Mexico, Netherlands, Pakistan, Poland, Singapore, Spain, Switzerland, Turkey, the UK, the US, or Uzbekistan.

User Agent

This is where you can set the user agent for the crawl, and change the viewport dimensions if required. This would have been set in step 1, but you can change this here if needed.

Screenshot of the user agent option in Advanced Settings, showing a dropdown to choose the user agent with the Google Smartphone option as the selected user agent, the user-agent string, the robots token, and the viewport width and height.

Mobile Site

If your website has a separate mobile site, you can enter settings here to help Lumar use a mobile user-agent when crawling the mobile URLs.

Screenshot of the Mobile Site settings in Advanced Settings, showing the option to include a mobile homepage, URL pattern, select the mobile user agent, see the mobile user-agent string and token, and the viewport dimensions.

Robots Directives

This allows you to identify additional URLs that can be excluded using a custom robots.txt file - allowing you to test the impact of pushing a new file to a live environment. You can also select to ignore X-Robots Directives, and/or ignore robots.txt for navigation requests and/or for resources. As mentioned above, site speed crawls are set to ignore robots.txt by default. If required, you can change these settings here.

Screenshot of the Robots Overwrite option in Advanced Settings, showing checkbox options to ignore X-Robots directives, ignote robots.txt for navigation requests or resources, and add an alternate version of the robots.txt file to use when crawling.

Stealth Mode Crawl

This allows you to run a crawl as if a set of real users were performing it. It runs at 1 URL every 3 seconds, and the user-agent and IP address is randomized for each request.

Screenshot of the stealth mode option in advanced settings, with a check-box to enable.

Custom Request Header

Some websites require custom HTTP headers to be used, or it will reject requests. If required, you can add the custom headers in the Custom Request Header option in Spider Settings. Click on the Create Custom Request Header button to add the details. If you need to add more, clicking the button again will add another row.

Screenshot of the custom request header option in advanced settings showing the option to create a custom request header.

Cookies

This setting is mostly used for accessibility crawls, to ensure any cookie popup is cleared so the crawl can progress. This is not generally required for tech SEO or site speed crawls, but you can see how to configure cookie details here if you need to use it.

Screenshot of the cookies options in advanced settings, showing the option to add a cookie, or import.

Extraction

Custom Extraction

Here you can use regular expressions to extract custom information from pages when they are crawled.

Screenshot of the custom extraction options in advanced settings, showing the option to add a new extraction rule.

Test Settings

Test Site Domain

This setting allows you to enter your test environment domain to allow comparisons with your live site.

Screenshot of the test site domain option in advanced settings, showing the option to add a test site domain name.

Custom DNS

This allows custom DNS entries to be configured if your website does not have public DNS records (e.g. a staging environment).

Screenshot of the custom DNS option in advanced settings, showing an example custom DNS, and a button to create a new custom DNS.

Authentication

If you need to include authentication credentials in all requests using basic authentication, you can enter the details here.

Screenshot of the Authentication option in advanced settings, showing the options to add a basic auth username and basic auth password.

Remove URL Parameters

If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these here.

Screenshot of the Remove URL parameters section of advanced settings in step 4 of the crawl setup process.

URL Rewriting

Here you can add a regular expression to match a URL and add an output expression.

Screenshot of the URL rewriting option in advanced settings, showing an example regular expression, a button to create a new rule, and checkboxes to use rewrite rules for the crawl, and strip hashtag fragments from all URLs.

Report Setup

Save HTML & Screenshots

Lumar enables you to save the static and rendered HTML and screenshots during the crawl. Find out about more about storing HTML and screenshots.

Screenshot of the save HTML and screenshots option in advanced settings, showing a checkbox to choose to save HTML and one to save screenshot. The screenshot also shows that each of these options will use an additional 0.1 credits.

API Callback

This is where you can specify a URL to be called once your crawl has been completed to trigger an external application.

Screenshot of the API callback option in advanced settings, showing the payload code, a field to enter the callback URL, and a button to create a new custom header.

Crawl Email Alerts

Set whether to receive email notifications on the progress of your crawl, and specify the email addresses that will receive notifications.

Screenshot of the crawl email alerts option in advanced settings, showing a radio button to always or never send alerts, and a field to enter the email addresses to receive the email notifications.

Report Settings

This last advanced setting allows you to change some of the specific parameters for Lumar reports.

Search

Welcome to the Lumar Knowledge Base