Have you ever wondered how Google finds and catalogs all the pages on the web? The answer: It uses site crawlers. Search engines use crawlers to read the web and populate an internal database. Marketers who understand them can use that knowledge to improve their SEO performance and get their brands in front of more people. Learn how site crawlers work, and how you can use them to help optimize your website for search engine indexing.
What is a site crawler?
A site crawler, also known as a web crawler, is a program that automatically scans, downloads and extracts web data from the vast ocean of information on the internet. “Crawling” is the term for the way the program reviews each webpage and follows links on the page to discover other webpages.
Search engines use their own private web crawlers to understand websites and indicate when to show them in search results. Publicly available crawlers, such as Screaming Frog, mimic this behavior to give website owners insight into how private crawlers evaluate their site.
How do site crawlers work?
If you think of the internet as a library, a web crawler is like a machine that automatically goes through and scans each page of every book, noting the results in its database. It reviews the content of a site, including the metadata, and indexes the page for search.
When a web crawler reads a site, it takes in the site’s HTML—the language used to make and show web pages—with special emphasis given to the links on each web page. It uses these links to build its understanding of how different pages and websites relate to each other.
Search engines use this information—the HTML and the link relationships—to determine which sites to display and how to rank them in search engine results.
What are the benefits of using site crawling tools?
- Makes your site accessible to search engines
- Reveals outdated pages on your site
- Finds redundant content
- Fixes broken links
- Understands your site structure
A web crawler mimics the settings and methods that search engines use in their own crawlers to give you a picture of how search engine bots perceive your site. Web crawlers are an audit tool, and if you can learn to interpret the results, you can create a powerful roadmap to improve your site’s search engine optimization (SEO) and user experience design (UX). Here are some of the ways site crawlers can help you:
Makes your site accessible to search engines
Your site might not get crawled for a number of reasons. These include faulty files or configurations in your site’s files, such as robots.txt directives, canonical URLs, or security protocols. They can also be due to technical setup errors. For example, a crawl will detect a redirect chain, in which one link redirects to another, redirecting back to the original URL, creating an infinite loop of redirects. A web crawler will identify pages that aren’t accessible due to one of these issues.
Reveals outdated pages on your site
A web crawler will give you a full list of every crawlable page on your site. This will often lead to surprises. For example, your landing page showcasing your Black Friday promotion from 2021 is still up—and people are finding it on Google.
This is a great opportunity for a site cleanup. Web crawlers do a better job of this than tools like Google Analytics because they can pick up pages even if the page has no visits or analytics tags installed. You can address old pages by unpublishing them and redirecting the old, irrelevant URLs to the most relevant live one.
Finds redundant content
Duplicate content can confuse users and search engine crawlers, leading to lower rankings and conversion rates. A web crawler will highlight key signals of duplicate content, such as similar headers or identical title tags. These pages often arise as a result of someone accidentally publishing and forgetting to change a duplicate product collection or blog post.
Fixes broken links
Crawlers provide a report of every link on your site and its status code (200, 301, and 404 errors, and shows whether the link goes to a live page). By filtering to “links to 404s,” you can identify broken links on your site. Broken links, whether internal or external links, are a poor SEO signal to Google and hurt user experience.
Understands your site structure
Site crawlers can provide reports on the number of pages per folder (e.g. pages within products/collections/) on your site. They can also report on the internal links to different sections of your site and your pages’ crawl depth (the number of subfolders the URL contains). Recently, Screaming Frog released automated visualizations of these reports. These provide insight into the breadth and depth of your site. You could gain insight, for example, on the number of blogs per category or your ratio of collection pages to product pages.
How do you use site crawling tools?
There are many web crawlers available to marketers, including Screaming Frog’s SEO Spider and SEMRush’s Site Audit. All crawlers operate on the same principles and usually follow these four steps:
1. Determine your crawl set
You can tell a web crawler to scan every page on your domain, but that might not be the best dataset. You might ask it to crawl only the pages in your XML sitemap, only pages within the products/subdirectory, or only web pages in a list from Google Sheets. Most crawlers can accept these types of starting instructions.
2. Configure crawl settings
There are many settings within a crawler. Here are examples of some of the most important ones:
Bot mimicking
You can set your crawler to act like the Google crawler, Bing crawler, or other search engine crawlers.
Follow directives
A robots.txt file serves as a guide, instructing search engines on which pages to crawl and which ones to avoid. You can tell your crawler to follow the directives in your robots.txt file or ignore them. The right setting will depend on whether you want to validate your robots.txt settings or if you’re confident in them.
JavaScript rendering
You can tell your crawler whether to crawl JavaScript websites or to focus on HTML-based sites.
3. Review crawl data
When a site crawl is complete, you will receive a series of reports—typically a large spreadsheet or series of spreadsheets. These reports provide lists of sites that match the crawler’s error filters. For example, a list of broken links, a list of web pages with duplicate title tags, a list of pages with no internal links, and so on. The marketer’s job is to review these and make an action plan to address the errors.
4. Prioritize action items
The hardest part is determining which reports from the crawl are most pressing. For example, duplicate title tags are a larger issue than duplicate meta descriptions, since search engines weigh title tags more heavily when evaluating the content of a web page. If you’re unsure how to prioritize, consider hiring an SEO expert.
Site crawler FAQ
What are examples of site crawlers?
An example of a search engine crawler is Googlebot, the crawler Google uses to populate its search results. An example of a free site crawler (after crawls for 500 URLs, there’s a charge) is SEO Spider by Screaming Frog.
Are site crawlers legal?
Yes, site crawlers are legal. However, what a business does with information from crawlers may or may not be legal. For example, in Argentina, users have a right to be forgotten, and it is illegal for search engines to show web pages that an individual has asked to be taken down.
What are the two types of site crawlers?
In practice, there are two types of site crawlers: automated programs that search engines use to crawl and index the entire web, and tools that crawl individual websites and generate website audit reports on them.
What do site crawlers do?
Site crawlers download and categorize the information on the internet. They do this to inform ranking algorithms or generate reports.