AI Companies Bypassing Robots.txt Instructions, Raising Concerns over Content Scraping

https://icaro.icaromediagroup.com/system/images/photos/16266667/original/open-uri20240622-56-1we5hdb?1719078182
ICARO Media Group
News
22/06/2024 17h41

In a recent development, several AI companies have come under scrutiny for bypassing robots.txt instructions, according to a report by Reuters. The controversy surrounding Perplexity, a self-proclaimed "free AI search engine," intensified after Forbes accused the company of republishing its stories without authorization. Wired further revealed that Perplexity had been scraping content from its website, as well as other Condé Nast publications, despite the Robots Exclusion Protocol.

The Shortcut, a technology website, also accused Perplexity of scraping its articles, further highlighting the issue. Now, Reuters reports that other AI companies are also disregarding robots.txt files and engaging in content scraping for training their technologies. TollBit, a startup connecting publishers with AI firms, sent a letter warning publishers of this trend, stating that "AI agents from multiple sources" were bypassing the protocol to retrieve content from websites.

While TollBit did not name specific companies in the letter, Business Insider claims to have learned that OpenAI and Anthropic, the creators of ChatGPT and Claude chatbots respectively, are among those bypassing robots.txt signals. Both companies had previously emphasized their commitment to respecting the instructions provided in the files.

During an investigation carried out by Wired, it was discovered that a machine on an Amazon server, attributed to Perplexity, was indeed bypassing the robots.txt instructions on the website. As evidence, Wired provided the company's tool with article headlines and prompts, which reportedly resulted in closely paraphrased content with minimal attribution. In some instances, inaccurate summaries were generated, including false claims about a California police officer committing a crime.

In response to these allegations, Perplexity CEO Aravind Srinivas denied ignoring the Robots Exclusion Protocol and lying about it. However, he admitted that the company relies on third-party web crawlers, including the one identified by Wired, which may not adhere to the protocol. Srinivas defended the practices of Perplexity and suggested that the Robots Exclusion Protocol is not a legal framework, implying the need for a new understanding between publishers and AI companies.

Regarding the inaccurate summaries generated by Perplexity's tool, Srinivas acknowledged that these mistakes occur, stating, "We have never said that we have never hallucinated." This reinforces concerns about the reliability and accountability of AI-powered content generation.

As the debate continues, it raises questions about the voluntary nature of compliance with the Robots Exclusion Protocol. With AI companies bypassing robots.txt instructions, there is a growing need for clearer guidelines and standards to protect content creators and ensure responsible use of AI technologies.

The implications of these actions by AI companies extend beyond intellectual property concerns, impacting the trust and credibility of AI-driven content. As technology rapidly advances, it becomes increasingly crucial to strike a balance between innovation and ethical practices in the AI industry.

The views expressed in this article do not reflect the opinion of ICARO, or any of its affiliates.

Related