What is Googlebot?
Googlebot refers to the web crawlers or “spiders,” responsible for discovering, collecting information and indexing new and updated content across the internet. It uses this information to build and update it’s search index.
While Google uses many different crawlers, Googlebot is often used to refer to two specific types of crawlers:
- Googlebot Smartphone that simulates a user on a mobile device and is the primary crawler for Google’s search index.
- Googlebot Desktop that simulates a user on desktop.
Why is Googlebot Important?
Google’s search index forms the basis for all Google search results. To ensure that a website appears in those results, Googlebot needs to find, crawl, and index it’s pages. Without Googlebot, webpage content would essentially be invisible to Google, no matter how relevant or valuable it might be.
Is Crawling and Indexing The Same?
No. Crawling refers to finding information and discovering pages on the web. Indexing, on the other hand, refers to storing, analysing and organising the information that is found while crawling.
How Does Googlebot Work?
The crawling process can be broken into 2 processes:
- URL Discovery: This is the process where Googlebot finds new and existing URLs on the internet. These are found by looking at URL’s that it has previously visited, following links on these pages and via URLs on a submitted sitemap.
- Fetching: Googlebot does not crawl all the pages at once. If it is a URL that it has not crawled before, it will crawl. However, if a URL has been crawled before, it will review various signals to determine if it has changed since its last visit and decide to crawl it again.
Note: Googlebot will only crawl the first 15 megabytes of each of the HTML, CSS, and JavaScript files on a page. If a file is any bigger, it will stop crawling and send the first 15mb crawled for indexing.
Googlebot Best Practices
Crawling is not a one-time event, but rather happens periodically. To ensure that a webpages is crawled correctly, there are several best practices to follow and maintain. These include:
- Check your robots.txt file: Ensure your robots.txt file is correctly configured to allow Googlebot to crawl the pages you want indexed. Be careful not to accidentally block important pages.
- Submit your sitemap: Regularly submit your XML sitemap through Google Search Console. This helps Googlebot discover all the important pages on your site, especially new or updated content.
- Use crawler directives properly: Implement meta tags like noindex, nofollow, and canonical tags wisely to guide Googlebot on which pages to crawl, index, or ignore.
- Build an internal linking strategy: Use internal links to help Googlebot navigate your site. Proper internal linking can signal the importance of specific pages, making them more likely to be crawled and indexed.
- Identify and fix crawlability and indexability issues: Regularly audit your website using tools like Google Search Console to spot and resolve issues that may prevent Googlebot from crawling or indexing your content.
Verifying Googlebot
Sometimes, webmasters and developers will create crawlers that pretend to be Googlebot. This is used to access websites that might otherwise block them.
Previously, website owners needed to run a DNS lookup (where a computer system converts a domain name into its corresponding numerical IP address) to verify Googlebot.
However, Google has released a list of public IPs that help verify if the requests are from Google. If the “Googlebot” has a different IP from the one listed, then it is not Googlebot.