Reddit Sues Perplexity Over Alleged Large-scale Data Scraping

Reddit has filed a lawsuit against the answer-engine company Perplexity and three data-scraping service providers, SerpApi, Oxylabs, and AWMProxy. The legal action seeks to halt what Reddit’s complaint describes as the unlawful, industrial-scale circumvention of its data protections.

The complaint alleges that Perplexity is a customer of at least one of these data-scraping firms. Reddit uses a metaphor to describe the alleged activity, comparing the providers to “would-be bank robbers” who, unable to access the company’s data “vault” directly, instead target the “armored truck” carrying the information. This implies the defendants are accessing Reddit’s content through indirect channels. The lawsuit asserts Perplexity is choosing to acquire data through these means rather than pursuing a direct licensing agreement, a path some of its competitors have taken.

According to the court filing, Reddit issued a cease-and-desist letter to Perplexity in May 2024, demanding it stop scraping data from the platform. Following the delivery of this letter, the volume of citations from Reddit appearing on Perplexity’s service reportedly increased. To further investigate, Reddit created a post on its platform that was configured to be crawlable only by Google. The company states that “within hours,” Perplexity’s answer engine “produced the contents” of this specific post. Reddit contends the only way Perplexity could have acquired this content was if it, or its co-defendants, scraped Google’s search results for Reddit content and rapidly integrated it into its system.

Samsung launches Perplexity TV app with Vision AI

The platform’s user-generated content, which consists of posts written and ranked by humans across a vast array of subjects, has become a valuable resource for training artificial intelligence models. In 2023, Reddit implemented API changes that led to user protests; the company positioned these changes as a way to ensure it was compensated for the use of its data by AI developers. Since then, Reddit has secured data-licensing deals with companies including OpenAI and Google and is reportedly seeking additional arrangements. This is not Reddit’s first legal challenge in this area; it previously sued Anthropic, alleging that its bots continued to access the site after the company had stated otherwise.

Ben Lee, Reddit’s chief legal officer, described the situation as an “industrial-scale ‘data laundering’ economy” fueled by an AI “arms race for quality human content.” He stated, “Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.” Lee identified the co-defendants Oxylabs UAB, AWM Proxy, and SerpAI as “textbook examples of this illegal behavior,” describing them as an obscure Lithuanian scraper, a former Russian botnet, and a company that advertises questionable tactics. He added, “Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search.”

In response to the lawsuit, Perplexity’s head of communication, Jesse Dwyer, stated that the company had not yet received the legal filing. Dwyer told The Verge, “we will always fight vigorously for users’ rights to freely and fairly access public knowledge.” He added, “Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest.”

Featured image credit