Running frequent and targeted crawls of your website is a key part of improving its technical health and improving rankings in organic search. In this guide, you'll learn how to crawl a website efficiently and effectively with Example.
You can view a step-by-step tour of how to set up a crawl by clicking on the gif below or read on for detailed instructions.
Getting Started
Once you've logged into Lumar Analyze, click on the New Project button in the top right corner of the screen.
You can then choose between an SEO project to generate reports and metrics to optimize for search engines, or Site Speed to generate reports and metrics to optimize for speed performance. Select the relevant option to move to Step 1 in the crawl setup process.
Step 1 - Basic Information
Before starting a crawl, it’s a good idea to get an understanding of your site’s domain structure. Enter your domain name into the ‘Domain’ field and click ‘Check’. You’ll then see a purple thumbs-up for the relevant domain.
If you’d like the crawl to include any sub-domains found in the crawl, check the ‘Crawl sub-domains’ option. Just underneath that, you can also choose whether to crawl both HTTP and HTTPS.
The project name will automatically populate with the domain name, but you can change this to anything you like to help you identify the project.
If you have subscribed to our Impact functionality, you will also be able to see benchmarks against the health scores. You can choose the industry to benchmark against in the ‘Industry’ drop-down (or choose ‘all industries’).
At the bottom of the screen, in the 'SEO Settings' or 'Site Speed Settings' section you can choose which user agent you want to use for the crawl.
Previewing Project Setup
You can access the preview in step 1 of the crawl setup process, where you’ll see an option to ‘Save & Preview’.
When the preview opens, you can choose to view a screenshot, the rendered HTML, static HTML, and response headers. Once you’ve reviewed the preview and everything looks OK, you can go back to the settings using the button in the bottom left corner, or the x in the top right.
You can also access the ‘Save and Preview’ option in step 4 of the crawl setup process. Just click on Advanced Settings, and you’ll see the option just underneath.
Step 2 - Sources
There are seven different types of URL sources you can include in your Lumar projects.
Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics data, as well as other data types. Check the box next to the relevant source to include the different elements in your crawl.
- Website. Crawl only the site by following its links to deeper levels. The crawl will start from your primary domain by default, but if you need it to start from a different point or multiple points, you can also specify those by expanding the website option.
- Sitemaps. Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled. When you expand the options here, you can also manually add sitemaps, select or deselect different sitemaps, and choose whether to discover and crawl new sitemaps in robots.txt or upload XML or TXT sitemaps.
- Backlinks. Upload backlink source data, and crawl the URLs to discover additional URLs with backlinks on your site. This can also be automatically brought in via integration with Majestic.
- Google Search Console. Use our Google Search Console integration to enrich your reports with data such as impressions, positions on a page, devices used, etc. You can also discover additional pages on your site which may not be linked. To use the integration, you will need to connect your Google Account to Lumar. See our ‘How to set up Google Search Console’ guide for more details.
- Analytics. Similarly, you can use our Google Analytics or Adobe Analytics integration, or upload analytics source data to discover additional landing pages on your site which may not be linked. Again, to use this you will need to connect your Google Account. See our ‘How to set up Google Analytics’ guide for more details.
- Log Summary. Upload log file summary data from log analyzer tools such as Logz.io or Splunk to get a view of how bots interact with your site. You can also upload log file data manually.
- URL lists. Crawl a fixed list of URLs, by uploading a list in a text file or CSV. Links on these pages will now be followed or crawled. This can be particularly useful for crawling a specific set of pages for accessibility issues, such as those that feature key templates used across the site.
Step 3 - Limits
In step 3 you can set the relevant limits for your crawl. We recommend starting with a small ‘website’ crawl to look for any signs that your site may be uncrawlable. The default options are to crawl 100 levels deep from the starting page, or a maximum of 100,000 URLs, whichever is reached first. For the first crawl, we would recommend changing the second option to ‘Or a maximum of 100 URLs; whichever is reached first’. You can then choose whether to be notified or finish the crawl if the limit is not enough. For the initial, small crawl, you can set this to finish anyway.
The other option you have in this step is to set the crawl speed. Lumar’s crawler is capable of crawling as fast as your infrastructure allows (up to 350 URLs per second for JavaScript-rendered crawls in testing). However, crawling at too fast a rate means your server may not be able to keep up, leading to performance issues on your site. To avoid this, Lumar sets a low maximum crawl speed for your account. We can increase this, but it is essential to consult with your dev ops team to identify the crawl rate your infrastructure can handle.
You can set the crawl speed using the slider. You can also add restrictions, to lower or increase the speed of crawls at particular times. For example, you may decide that you want a slower crawl rate during peak times for your site, but increase in the early hours of the morning when traffic is low.
Step 4 - Settings
In step 4 you can start your crawl. Once you have completed your initial small crawl to make sure everything is correct, you’ll also be able to set a schedule to run crawls at regular intervals if required.
To set a schedule, choose your required frequency from the drop down and then choose your starting date and time. You can also choose ‘One Time’ to schedule a single, non-recurring crawl at a future date or time.
Underneath the Schedule options, you’ll also see a button for Advanced Settings. Clicking this will open up a range of additional options you can set as required. You’ll see a check on any elements that have settings applied (which are likely to have been added during the steps above) and can open up each section to add or amend new settings.
Note: Because all pages impact the overall user experience of a site, Site Speed crawls are set to ignore robots.txt by default. If required, you can change these settings in the Robots.txt section below before starting your crawl.
The advanced options are:
- Scope:
- Domain scope. Detailing the primary domain, whether sub-domains and both HTTP and HTTPS will be crawled, and any secondary domains that will be crawled. These may have been set in steps 1 and 2 above.
- URL scope.
- Here, you can choose to include only specific URL paths or exclude specific URL paths.
- For page grouping, you can create a new group, add a name for page grouping, and add a regular expression in the 'Page URL Match' column. Select the percentage of URLs that you would like to crawl. URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the 'Page Group Restrictions' report and are not crawled.
-
- Resource restrictions. To define which types of URLs you want Lumar to crawl (e.g. non-HTML, CSS resources, images, etc.). You can also set Lumar to ignore an invalid SSL certificate.
- Link restrictions. To define which links you want Lumar to crawl (e.g. follow anchor links, pagination links, etc.).
- Redirect settings. To choose whether to follow internal or external redirects.
- Link validation. Where you can choose which links are crawled to see if they are responding correctly.
- Spider settings:
- Start URLs. This was set in Step 2 above but can be accessed and changed here.
- JavaScript rendering. Here you can enable or disable JavaScript rendering. You can also add any custom rejections, any additional custom JavaScript, and any external JavaScript resources.
- Crawler IP settings. Where you can select regional IPs if required. If your crawl is blocked, or you need to crawl behind a firewall (e.g. a staging environment), you will need to ask your web team to whitelist 52.5.118.182 and 52.86.188.211.
- User agent. By default, the crawler will use the Googlebot Smartphone user agent, but you can change this here if needed.
- Robots.txt. This allows you to identify additional URLs that can be excluded using a custom robots.txt file - allowing you to test the impact of pushing a new file to a live environment. You can also select to ignore robots.txt for navigation requests and/or for resources. As mentioned above, site speed crawls are set to ignore robots.txt by default. If required, you can change these settings here.
- Mobile site. If your website has a separate mobile site, you can enter settings here to help Lumar use a mobile user agent when crawling the mobile URLs.
- Stealth mode crawl. Allowing you to run a crawl as if it was performed by a set of real users.
- Custom request header. Where you can add any custom request headers that will be sent with every request.
- Cookies. This setting is mostly used for accessibility crawls, to ensure any cookie popup is cleared so the crawl can progress. This is not generally required for tech SEO crawls, but you can see how to configure cookie details here if you need to use it.
- Extraction:
- Custom extraction. Where you can use regular expressions to extract custom information from pages when they are crawled.
- Test settings:
- Test site domain. Here you can enter your test environment domain to allow comparisons with your live site.
- Custom DNS. This allows custom DNS entries to be configured if your website does not have public DNS records (e.g. a staging environment).
- Authentication. To include authentication credentials in all requests using basic authentication.
- Remove URL parameters. If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the 'Remove URL Parameters' field under 'Advanced Settings.'
-
- URL rewriting. Add a regular expression to match a URL and add an output expression.
- Report setup:
- API callback. Where you can specify a URL to be called once your crawl has been completed, to trigger an external application.
- Crawl email alerts. To set whether to receive email notifications on the progress of your crawl, and specify the email addresses that will receive notifications.
- Report settings. Here you can specify additional settings for your reports.
The Final Step
Once your first, smaller crawl has completed, take a look at the results to see if everything looks OK.
First, check the number of URLs crawled in the project summary. If you selected to crawl a maximum of 100 URLs in step 3, then the URL count should be around that number. If the URL count is 0, then it suggests your site has blocked the crawler, and the IP addresses mentioned above need to be whitelisted.
Secondly, check the domains of the URLs that are returned in the reports. Select the project from the project list and click on Accessibility Overview. You can then click on the top Accessibility Health Score Error on the right-hand side to get into a report.
Once the report opens, check the ‘Example URL’ that appears in the URL details column and check that the domain or subdomain is correct.
If everything looks OK, you can then return to step 3 of the crawl setup to increase the limits and run a full crawl.
Compare and Download Crawl Settings
Once your crawl is set up, you can also compare settings for the current and previous crawl, and download the settings in CSV format. To do this, select your project and then click on the crawl comparison button in the top right of the overview dashboard. From there you can then click the download button to download as CSV.
Handy Tips
Setting for Specific Requirements
Changing Crawl Rate
Analyze Outbound Links
Change User Agent