What is Robots.txt?

Sometimes, you don’t want web crawlers to access specific URLs on your website. To do so, you can use a robots.txt file to block certain pages or sections of your website. This file gives instructions to search engine bots about which parts of your site they should or shouldn’t crawl.

How Does Robots.txt work?

The robots.txt file is a simple text file located in the root directory of your website. When a crawler visits your website, it checks the robots.txt file for any guidelines before it starts crawling.

Note: Not all crawlers are “good” and may ignore these instructions. These are usually for “scraping” where they extract different data from your site without permission.

You can tell crawlers which pages they are allowed to visit and which ones to avoid by specifying directives like “allow” or “disallow.” For example, if you have pages you don’t want appearing in search results—such as admin sections or duplicate content—you can block them by adding the appropriate instructions in your robots.txt file.

However, it’s important to note that while most crawlers will adhere to the rules specified by robots.txt, the file should only be used to reduce and optimise crawler resources and not control the indexing of pages. This is because a disallowed URL can still be indexed if it is found by an external link.

Robots.txt Features:

user-agent: The crawler the rules will apply to

disallow: A path that must not be crawled or accessed

allow: An optional field that says what path that can be crawled

sitemap: An optional field that describes the location of the sitemap file

crawl-delay: An optional field that controls the crawling speed. This is, however, not supported by GoogleBot.

For Example:

User-agent: Googlebot

Disallow: /admin/

Allow: /admin/allowed-page/

Sitemap: https://www.example.com/sitemap.xml

So, this example blocks Googlebot from accessing the /admin/ directory but allows it to crawl the /admin/allowed-page/.

Why Are Robots.txt Important?

The primary benefit of a robots.txt file is that it allows you to optimize the crawl budget of your website. Search engines have a limited amount of time to crawl each site, so by using robots.txt, you can direct them away from pages that aren’t important or for public access. For CMS like WordPress, the admin pages are automatically blocked by crawlers.

Important note: It is not recommended to rely solely on the robots.txt file to control the indexing of pages. As we’ve explained, robots.txt is more of a guideline that a rule.

Does My Website Need One?

Not every website absolutely requires a robots.txt file. If you have a smaller website with just a few pages, or if you don’t have any specific content that you need to block from search crawlers, it may not be necessary.

In many cases, search engines can effectively crawl your site without it. However, if you have large volumes of content, duplicate pages, or sensitive areas you’d like to keep private (like admin panels or staging environments), implementing a robots.txt file can direct them to focus on more important pages and content.

Robots.txt

What is Robots.txt?

How Does Robots.txt work?

Why Are Robots.txt Important?

Does My Website Need One?

Ready to start your next campaign?

Ready to start marketing?

We'd love to hear from you...