What is a Crawler?
A crawler (also known as a robot or spider) is an internet program that systematically browses the web. Search engines primarily use crawlers to discover, process, and index web pages, allowing them to appear in search results.
Beyond processing HTML, some specialised crawlers are also used to index images and videos. The most important web crawlers to be aware of are those used by the world’s leading search engines: Googlebot, Bingbot, Yandex Bot, and Baidu Spider.
Generally, search engine crawlers’ purpose is to add information to the search index by finding out what is on your website specifically. The crawlability of your website matters because if your site can’t be crawled, then your pages and content cannot appear in Google search results.
Types of Crawlers
There are primarily two types of crawlers, namely:
- On-Demand Bots: These bots only crawl a restricted amount of pages and only upon request; an example of this is Ahrefs’ site audit bot.
- Constant Crawling Bots: These types of bots constantly crawl new and old pages (without the need for a request) – an example is Googlebot.
Good vs Bad Crawlers
A good crawler can benefit your site by adding your content to a search index or assisting with website audits. Key features of a good crawler include identifying itself, following your directives, and adjusting its crawling rate to avoid overloading your server.
In contrast, a bad crawler provides no value to website owners and may have malicious intent. Web properties use robots.txt files and on-page directives to indicate which pages should be crawled and indexed, but bad crawlers may ignore these, fail to identify themselves, overload servers, and even steal content and data.
How Do Web Crawlers Work?
Now that we’ve covered what crawlers are and why they’re important let’s take a closer look at how search engine crawlers actually work.
In essence, a web crawler like Googlebot discovers URLs on your website through sitemaps, internal links, and manual submissions via Google Search Console. It follows the “allowed” links on those pages while adhering to the rules set in your robots.txt file and respecting any “nofollow” attributes on links and pages.
It’s also worth noting that some websites—particularly those with over 1 million pages that are regularly updated or with 10,000 pages of frequently changing content—may have a limited crawl budget. A crawl budget refers to the amount of time and resources a bot can allocate to a website in a single session. While the concept of crawl budgets generates a lot of discussion in SEO communities, the majority of website owners won’t need to worry about it.
Crawl Priorities
Due to the limited capacity of crawl budgets, crawlers operate based on a set of priorities. For example, Googlebot considers factors like:
- The PageRank of the URL
- How often the page is updated
- Whether the page is new
This allows the crawler to focus on the most important pages on your site first.
Mobile vs. Desktop Crawler Versions
Googlebot, for instance, has two main versions: Googlebot Desktop and Googlebot Smartphone. With Google’s shift to mobile-first indexing, its smartphone agent is now the primary bot used for crawling and indexing pages.
It’s important to understand that different versions of a website may be presented to these different types of crawlers. Technically, the bot identifies itself to a web server using the HTTP request header User-Agent, along with a unique identifier.