How to Crawl a Website (SEO) – Lumar

Running frequent and targeted crawls of your website is a key part of improving its technical health and improving rankings in organic search. In this guide, you'll learn how to crawl a website efficiently and effectively with Lumar.

You can view a step-by-step tour of how to set up a crawl, or read on for detailed instructions.

If you have already created a project, you can clone the settings by using the options ellipsis in the all projects view. This will clone the project settings, but not any crawl data, manually uploaded crawl source files (such as URL lists), or Monitor alerts.

Getting Started

Once you've logged into Lumar Analyze, click on the New Project button in the top right corner of the screen.

Screenshot of the Lumar Analyze all projects view, with the new project button highlighted. On the next screen, you can choose the type of project you want to create: SEO, Site Speed, or Accessibility. For this guide, we'll follow the Site Speed route, but you can see instructions to create a Site Speed project here, or to create an accessibility project here.

Screenshot of the new project screen showing the option to create an SEO, accessibility, or site speed project. The Select SEO button is highlighted.

Step 1 - Basic Information

Before starting a crawl, it’s a good idea to get an understanding of your site’s domain structure. Enter your domain name into the ‘Domain’ field and click ‘Check’. You’ll then see a purple thumbs-up for the relevant domain.

Screenshot of step 1 of the Lumar crawl setup process, with the domain section highlighted. The screen shows https with www as the mapping that will be used with the project.

If you’d like the crawl to include any sub-domains found in the crawl, check the ‘Crawl sub-domains’ option. Just underneath that, you can also choose whether to crawl both HTTP and HTTPS.

The project name will automatically populate with the domain name, but you can change this to anything you like to help you identify the project.

If you have subscribed to our Impact functionality, you will also be able to see benchmarks against the health scores. You can choose the industry to benchmark against in the ‘Industry’ drop-down (or choose ‘all industries’).

At the bottom of the screen, in the 'SEO Settings' section you can choose which user agent you want to use for the crawl.

Screenshot of step 1 of the project setup process with the user agent options highlighted.

You can also choose to Save the HTML and Screenshots for the crawl, which can help to increase accuracy for page measurements and provide more information to investigate anomalies in the data.

Enabling these features will cost an additional 0.1 credits per attachment saved.

Screenshot of step 1 of the project setup process with the save html and screenshots option highlighted.

Previewing Project Setup

Once you've configured your settings, you can preview the project setup, so you can verify settings and check the crawl configuration before starting the crawl. For example, you can check for 401 errors upfront using the preview feature. This means you can spot issues before the crawl starts and make the necessary amendments rather than discovering issues when the crawl completes.

You can access the preview in step 1 of the crawl setup process, where you’ll see an option to ‘Save & Preview’.

Screenshot of step 1 of the Lumar crawl setup process, with the save and preview button highlighted.

When the preview opens, you can choose to view a screenshot, the rendered HTML, static HTML, and response headers. Once you’ve reviewed the preview and everything looks OK, you can go back to the settings using the button in the bottom left corner, or the x in the top right.

Screenshot of the crawl setup preview, showing the screenshot captured.

You can also access the ‘Save and Preview’ option in step 4 of the crawl setup process. Just click on Advanced Settings, and you’ll see the option just underneath.

Screenshot of step 4 of the Lumar crawl setup process with the advanced settings and save and preview buttons highlighted.

Step 2 - Sources

There are seven different types of URL sources you can include in your Lumar projects.

Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics data, as well as other data types. Check the box next to the relevant source to include the different elements in your crawl.

Website. Crawl only the site by following its links to deeper levels. The crawl will start from your primary domain by default, but if you need it to start from a different point or multiple points, you can also specify those by expanding the website option.
Sitemaps. Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled. When you expand the options here, you can also manually add sitemaps, select or deselect different sitemaps, and choose whether to discover and crawl new sitemaps in robots.txt or upload XML or TXT sitemaps. You can also delete sitemaps from here if required.
Backlinks. Upload backlink source data, and crawl the URLs to discover additional URLs with backlinks on your site. This can also be automatically brought in via integration with Majestic.
Google Search Console. Use our Google Search Console integration to enrich your reports with data such as impressions, positions on a page, devices used, etc. You can also discover additional pages on your site which may not be linked. To use the integration, you will need to connect your Google Account to Lumar. See our ‘How to set up Google Search Console’ guide for more details.
Analytics. Similarly, you can use our Google Analytics or Adobe Analytics integration, or upload analytics source data to discover additional landing pages on your site which may not be linked. Again, to use this you will need to connect your Google Account. See our ‘How to set up Google Analytics’ guide for more details.
Log Summary. Upload log file summary data from log analyzer tools such as Logz.io or Splunk to get a view of how bots interact with your site. You can also upload log file data manually.
URL lists. Crawl a fixed list of URLs, by uploading a list in a text file or CSV. Links on these pages will now be followed or crawled. This can be particularly useful for crawling a specific set of pages, such as those that feature key templates used across the site.

Screenshot of step 2 in the Lumar crawl setup process, showing website, sitemaps, backlinks, Google Search Console and Analytics as selected sources for the crawl. Log Summary and URL Lists are unchecked.

Step 3 - Limits

In step 3 you can set the relevant limits for your crawl. We recommend starting with a small ‘website’ crawl to look for any signs that your site may be uncrawlable. The default options are to crawl 100 levels deep from the starting page, or a maximum of 100,000 URLs, whichever is reached first. For the first crawl, we would recommend changing the second option to ‘Or a maximum of 100 URLs; whichever is reached first’. You can then choose whether to be notified or finish the crawl if the limit is not enough. For the initial, small crawl, you can set this to finish anyway.

Screenshot of step 3 of the Lumar crawl setup process with the limits highlighted, showing that the crawl should run 100 levels deep from the starting page, or a maximum of 100,000 URLs, whichever is reached first.

The other option you have in this step is to set the crawl speed. Lumar’s crawler is capable of crawling as fast as your infrastructure allows (up to 350 URLs per second for JavaScript-rendered crawls in testing). However, crawling at too fast a rate means your server may not be able to keep up, leading to performance issues on your site. To avoid this, Lumar sets a low maximum crawl speed for your account. We can increase this, but it is essential to consult with your dev ops team to identify the crawl rate your infrastructure can handle.

You can set the crawl speed by choosing the relevant option from the dropdown. You can also add restrictions to lower or increase the speed of crawls at particular times. For example, you may decide that you want a slower crawl rate during peak times for your site, but increase in the early hours of the morning when traffic is low.

Screenshot of step 3 of the Lumar crawl setup process, with the restrictions section open and highlighted. This is showing the crawl speed as 3 URLs per second, and to crawl daily but only between 1am and 4am.

Step 4 - Settings

In step 4 you can start your crawl. Once you have completed your initial small crawl to make sure everything is correct, you’ll also be able to set a schedule to run crawls at regular intervals if required.

To set a schedule, choose your required frequency from the dropdown and then choose your starting date and time (down to 5-minute increments). You can also choose ‘One Time’ to schedule a single, non-recurring crawl at a future date or time.

Screenshot of step 4 of the crawl setup process, showing the schedule options as no schedule, one time, hourly, daily, Monday to Friday, weekly, fortnightly, monthly, bimonthly, and quarterly.

Underneath the Schedule options, you’ll also see a button for Advanced Settings. Clicking this will open up a range of additional options you can set as required. You’ll see a check on any elements that have settings applied (which are likely to have been added during the steps above) and can open up each section to add or amend new settings.

Because all pages impact the overall user experience of a site, Site Speed crawls are set to ignore robots.txt by default. If required, you can change these settings in the Robots.txt section below before starting your crawl.

Read an overview of the advanced settings available in Lumar.

Screenshot of step 4 of the crawl setup process, with the advanced settings button highlighted, and the advance options open.

The Final Step

Once your first, smaller crawl has completed, take a look at the results to see if everything looks OK.

First, check the number of URLs crawled in the project summary. If you selected to crawl a maximum of 100 URLs in step 3, then the URL count should be around that number. If the URL count is 0, then it suggests your site has blocked the crawler, and the IP addresses mentioned above need to be whitelisted.

Screenshot of the Analyze project list view showing a completed small test run for an SEO project. The detail shows 1 crawl completed and 10 URLs crawled.

Secondly, check the domains of the URLs that are returned in the reports. Select the project from the project list and click on SEO Overview. You can then click on the top SEO Health Score Error on the right-hand side to get into a report.

Screenshot of the SEO overview dashboard with the Disallowed URLs with Traffic error highlighted.

Once the report opens, check the ‘Example URL’ that appears in the URL details column and check that the domain or subdomain is correct.

Screenshot of the Lumar Analyze report, with the found at URL highlighted.

If everything looks OK, you can then return to step 3 of the crawl setup to increase the limits and run a full crawl.

Compare and Download Crawl Settings

Once your crawl is set up, you can also compare settings for the current and previous crawl, and download the settings in CSV format. To do this, select your project and then click on the crawl comparison button in the top right of the overview dashboard. From there you can then click the download button to download as CSV.

Screenshot of the top of a Lumar Analyze report with the compare option opened and highlighted, and the download button highlighted.

Handy Tips

Setting for Specific Requirements

If you have a test/sandbox site you can run a 'Comparison Crawl' by adding your test site domain and authentication details in 'Advanced Settings'. For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website. To crawl an AJAX-style website, with an escaped fragment solution, use the 'URL Rewrite' function to modify all linked URLs to the escaped fragment format. Read more about our testing features - Testing Development Changes Before Putting Them Live.

Changing Crawl Rate

Watch for performance issues caused by the crawler while running a crawl. If you see connection errors or multiple 502/503 type errors, you may need to reduce the crawl rate in the 'Limits' tab. If you have a robust hosting solution, you may be able to crawl the site at a faster rate. The crawl rate can be increased at times when the site load is reduced - 4 a.m. for example. Head to the 'Limits' tab and 'Show Restrictions'

Analyze Outbound Links

Sites with a large quantity of external links, may want to ensure that users are not directed to dead links. To check this, select 'Crawl External Links' under 'Project Settings', adding an HTTP status code next to external links within your report. Read more on outbound link audits to learn about analyzing and cleaning up external links.

Change User Agent

See your site through a variety of crawlers' eyes (Facebook/Bingbot etc.) by changing the user agent in 'Advanced Settings'. Add a custom user agent to determine how your website responds.

Next Steps

Reset your 'Project Settings' after the crawl, so you can continue to crawl with 'real-world' settings applied. Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.

Search

Welcome to the Lumar Knowledge Base