Amazon Web Services Investigates Perplexity AI Over Alleged Robots Exclusion Protocol Violations

https://icaro.icaromediagroup.com/system/images/photos/16275664/original/open-uri20240628-18-1edq1uj?1719615866
ICARO Media Group
News
28/06/2024 22h36

In response to recent allegations, Amazon Web Services (AWS) has launched an investigation into Perplexity AI to determine whether the company's crawler is bypassing the Robots Exclusion Protocol, as reported by Wired.

The Robots Exclusion Protocol is a web standard where developers place a robots.txt file on a domain to instruct bots on accessing specific pages. While compliance with these instructions is voluntary, reputable companies generally respect them. However, Wired previously discovered that Perplexity's crawler was bypassing these instructions.

Wired also found that a virtual machine hosted on an AWS server using the IP address 44.221.181.252, which Wired identified as "certainly operated by Perplexity," bypassed the robots.txt instructions on multiple websites, including Condé Nast properties, The Guardian, Forbes, and The New York Times.

To verify Perplexity's actions, Wired tested the company's chatbot by entering article headlines or descriptions, and found that it produced results closely paraphrasing their articles "with minimal attribution."

Reuters also reported that Perplexity is not the only AI company bypassing robots.txt files to gather content for training language models. However, AWS's investigation appears to be focused specifically on Perplexity AI.

An AWS spokesperson stated that customers must comply with robots.txt instructions when crawling websites. They highlighted that AWS's terms of service strictly prohibit illegal activities, and customers are responsible for adhering to these terms and applicable laws.

In response to Amazon's investigation, Perplexity spokesperson Sara Platnick denied that their crawlers were bypassing the Robots Exclusion Protocol. Platnick emphasized that PerplexityBot, which operates on AWS, respects robots.txt instructions and assured that Perplexity-controlled services do not violate AWS Terms of Service.

Platnick clarified that Amazon's inquiry into Wired's report was part of their investigation protocol for allegations of resource abuse. However, she admitted that PerplexityBot does ignore robots.txt when users include a specific URL in their chatbot inquiries.

Perplexity CEO, Aravind Srinivas, previously denied allegations of ignoring the Robots Exclusion Protocol and misleading statements. However, he did acknowledge using third-party web crawlers in addition to their own, and the bot identified by Wired was one of them.

As of the latest update, there has been no official communication from Amazon to Perplexity regarding the investigation, apart from Wired's media inquiry.

Overall, this investigation sheds light on the challenges faced by AI companies in navigating web protocols and the importance of respecting the Robots Exclusion Protocol to ensure ethical usage of resources.

The views expressed in this article do not reflect the opinion of ICARO, or any of its affiliates.

Related