Google open sources its robots.txt parser to make Robots Exclusion Protocol an official internet standard

Yesterday, Google announced that it has teamed up with the creator of Robots Exclusion Protocol (REP), Martijn Koster and other webmasters to make the 25 year old protocol an internet standard. The REP, better known as robots.txt, is now submitted to IETF (Internet Engineering Task Force). Google has also open sourced its robots.txt parser and matcher as a C++ library.

https://twitter.com/googlewmc/status/1145634145261051906

REP was created back in 1994 by Martijn Koster, a software engineer who is known for his contribution in internet searching. Since its inception, it has been widely adopted by websites to indicate whether web crawlers and other automatic clients are allowed to access the site or not.

When any automatic client wants to visit a website it first checks for robots.txt that shows something like this:

User-agent: *

Disallow: /

The User-agent: * statement means that this applies to all robots and Disallow: / means that the robot is not allowed to visit any page of the site.

Despite being used widely on the web, it is still not an internet standard. With no set in stone rules, developers have interpreted the “ambiguous de-facto protocol” differently over the years. Also, it has not been updated since its creation to address the modern corner cases. This proposed REP draft is a standardized and extended version of REP that gives publishers fine-grained controls to decide what they like to be crawled on their site and potentially shown to interested users.

The following are some of the important updates in the proposed REP:

It is no longer limited to HTTP and can be used by any URI-based transfer protocol, for instance, FTP or CoAP.

Developers need to at least parse the first 500 kibibytes of a robots.txt. This will ensure that the connections are not open for too long to avoid any unnecessary strain on servers.

It defines a new maximum caching time of 24 hours after which crawlers cannot use robots.txt. This allows website owners to update their robots.txt whenever they want and also avoid the overloading robots.txt requests by crawlers.

It also defines a provision for cases when a previously accessible robots.txt file becomes inaccessible because of server failures. In such cases the disallowed pages will not be crawled for a reasonably long period of time.

This updated REP standard is currently in its draft stage and Google is now seeking feedback from developers. It wrote, “we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.”

To know more in detail check out the official announcement by Google. Also, check out the proposed REP draft.

Do Google Ads secretly track Stack Overflow users?

Curl’s lead developer announces Google’s “plan to reimplement curl in Libcrurl”

Google rejects all 13 shareholder proposals at its annual meeting, despite protesting workers