January 2022 - Crawler enhancements

We’re pleased to announce some improvements to our crawler! 

We’ve aligned closer to Google behaviors with data improvements and by using the latest version of Google’s industry-leading rendering and parsing engine, Chromium. The enhancements also provide a foundation for future speed and scale enhancements, and increase the breadth of data we will be able to extract from pages. 

IMPORTANT NOTE: The crawler improvements are being rolled out to customers based on crawl frequency and URL volume over the coming months. You’ll see an in-app notification when it has been enabled on your account. Crawler improvements will be automatically applied for web crawls only at this stage, with additional sources following later in the year. 

Alignment to Google Behaviors

Our crawler now uses Google’s industry-leading rendering and parsing engine, Chromium, to ensure we closely align to Google behaviors. 

  • We now use Chromium to fully request, create DOM and parse metrics for both JS enabled and disabled. The actual HTML used to extract metrics can be found in Chrome (inspect > elements) rather than View Source. You will need to turn off JS rendering in the browser to see the pre-rendered DOM line for like. 
  • Some characters are now stored and displayed differently, replicating what’s seen in Google Search Results more closely and making data easier to interpret. You may see some changes in metrics, including h1_tag, page_title and url_to_title. 
  • To align with Google, Noindex in robots.txt is no longer supported. Pages that are noindexed in robots.txt will now be reported as indexed. 
  • We now have a limit of 5,000 links per page in line with the rough ceiling estimate provided by Google. Links exceeding the limit will no longer be crawled.
  • Meta_noodp, meta_noydir, noodp and noydir metrics have been deprecated, reflecting changes made by Google. These metrics still exist, but are defaulted to FALSE value. 
  • All acceptable fallback values are reflected in the valid_twitter_card metric calculation. You may therefore see some differences in this metric. 

Data Improvements

We’ve also made data improvements to reflect Google practices and enhance the insights you get from Lumar, including:

  • Improved metrics due to word_count and content_size being calculated from body only. You may see an increase in Thin Pages as the reported content_size metric will be slightly lower. Duplicate body/page reports may also be impacted due to slightly different content used to determine duplicate content. Some other metrics may also be affected, such as content_html_ratio and rendered_word_count_difference.
  • We’ve cleaned up whitespace to remove the risk of unique links caused by whitespace differences. You may see a number of added/removed unique links on the first crawl, due to different digests for anchor texts. 
  • The calculation of meta_title_length_px metrics has been improved. You may see slightly higher values for this metric and may see an increase in Max Title Length Report. 
  • \s, \n and \t’s stripped from meta descriptions to ensure consistency in max length reports. You may see differences in the meta_dexcription and meta_description_length metrics and see less pages in the Max Description Length Report. 
  • Previously, 300, 304, 305 and 306 were treated as a redirect if a_location was provided. These are now always treated as non-redirects, which may result in a drop in redirects. 

Additional Improvements

We’ve made a few other improvements, including:

  • Structured data metrics are now available on non-JS crawls. 
  • You can now run crawls on regional IPs and custom proxies with JS rendering on (including for stealth crawls). 
  • We have new metrics for AMP pages which may mean you see more pages appearing in AMP page metric reports. The new metrics are: Mobile_desktop_content_difference, mobile_desktop_content_mismatch, mobile_desktop_links_in_difference, mobile_desktop_links_out_difference and mobile_desktop_wordcount_difference. 
  • The maximum page size limit has been removed, allowing reporting on whatever Chromium can process within the timeout period. This may result in new pages (and additional pages linked from those pages) being crawled. 
  • We are now reporting on unsupported resource types. This provides additional reporting, including http status, content type and size. We also report on the non-html pages found in the crawl report. 
  • We’ve increased the number of redirects we follow from 10 to 15. You’ll now be able to see longer redirect chains in reports (e.g. Redirect Chain).

Other Changes

A few final things to be aware of:

  • The order in which URLs are crawled during web crawl has changed. When crawling partial levels, the URLs may change the first time a crawl is run with the enhanced crawler. 
  • Content_size is nwo calculated in character length rather than bytesize. This can affect content that contains non-UTF characters which will now be reported as 1 character, rather than 2 bytes. 
  • We’ve fixed a bug that meant meta refresh redirects may have previously been reported as primary pages. You may see a small drop in the number of primary pages as a result. 
  • The crawl rate is actually the processed URL rate and it’s possible for a large number of URL fetching to be completed simultaneously, resulting in a higher number of URLs processed. While the crawl rate may show higher than the desired setting, the actual crawl rate setting will not be exceeded. 
  • New headers have capitalized keys, rather than lower case.
  • Malformed URLs are no longer included in links count. You may see a lower link_out_count metric value in the reports. 
  • Any URLs that cannot be handled by the Chromium parser are now considered malformed, which may result in some differences in the Uncrawled URLs report. 

We’ll be providing more information on additional crawler improvements over the coming months. In the meantime, if you have any questions please contact our support team who are happy to help.