Multiple AI Companies Ignoring Robots.txt, Disregarding Publisher Requests
ICARO Media Group
In a concerning development for publishers, multiple artificial intelligence (AI) companies have been bypassing the guidelines set by publishers to block web content scraping, as revealed in a warning issued by content licensing start-up TollBit. The start-up, which aims to facilitate licensing agreements between content-hungry AI companies and publishers, has observed several AI agents disregarding the standard protocol known as robots.txt.
TollBit, an early-stage start-up boasting an impressive portfolio of 50 live websites as of May, has identified Perplexity as one of the offenders flouting the robots.txt guidelines. However, according to TollBit's analytics, multiple AI agents from various sources are also choosing to override this protocol when retrieving content from publishers' sites.
The robots.txt protocol is an essential tool used by publishers to specify which parts of their websites can be crawled by AI systems. By deliberately bypassing this protocol, AI agents obtain access to content that publishers may not want to be used without proper authorization or compensation.
TollBit expressed its concern in a letter stating, "What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites. The more publisher logs we ingest, the more this pattern emerges."
This worrisome trend has raised alarm bells within the News Media Alliance, a prominent trade group representing over 2,200 publishers based in the United States. The president of the alliance commented on the issue, highlighting the potential ramifications for the industry, stating, "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."
In addition to this disregard for publisher guidelines, news sites face another challenge in the form of news summaries generated by AI. Publishers have expressed their concerns since the introduction of a Google product last year that utilizes AI to create summaries in response to certain search queries. To prevent their content from being used by Google's AI for summary generation purposes, publishers must employ the same tool that would remove their content from Google search results entirely, rendering them nearly invisible on the web.
The impact of these developments on the publishing industry is significant, as they continue to struggle to monetize their content and support journalism in an era when AI technologies play an increasingly central role. TollBit's warning sheds light on a prevalent issue that requires attention and action from both AI companies and publishers to establish a fair and sustainable ecosystem for content usage in the digital landscape.