A discussion in San Francisco about “ethical” AI providers has highlighted the growing tension between AI companies and website publishers.
The debate centers on how AI companies harvest web data to train their models and power their chatbots, often without sending traffic back to the original content sources.
Measuring the imbalance with a crawl-to-refer ratio
For years, the web has operated on an unwritten agreement: websites allow search engine bots to crawl their content in exchange for referral traffic, which drives users and revenue. Generative AI chatbots disrupt this model by providing direct answers, reducing the need for users to visit the source website.
To quantify this shift, Cloudflare, which supports about 20% of the world’s websites, has started tracking a “crawl-to-refer ratio.” This metric compares how many times a company’s bots access a website for data against the number of human users it refers back to that site. A high ratio indicates a company is taking far more data than the value it returns in traffic.
How different AI companies compare
Data from the first week of September revealed significant differences between companies. Anthropic, the maker of the Claude chatbot, showed a particularly high crawl-to-refer ratio. In response to the findings, Anthropic said it could not confirm Cloudflare’s figures and noted that a new web search feature launched earlier this year is generating a rapidly growing amount of referral traffic. OpenAI did not respond to requests for comment. Perplexity, another AI answer engine, provided a detailed statement on the matter.
In the case of public content, publishers can choose not to make their content public. In the case of facts, copyright law, as you know, has always drawn a line between facts and expression. That’s a foundation of human inquiry itself.
A methodological note states that these ratios only track web activity and exclude traffic from native apps, which could lower the overall numbers. However, the methodology is applied consistently to all companies.
The impact on website owners and Google’s changing role
This large-scale data collection has direct costs for website owners. A Business Insider report from about a year ago noted that crawling from Anthropic and OpenAI bots was causing significant increases in traffic costs for some sites, with one developer reporting their client’s cloud-computing bills had doubled.
Google’s crawl-to-refer ratio is currently lower than many AI-first companies, largely because its traditional search results still link out to websites. However, as Google integrates more direct AI answers through features like AI Overviews, its ratio is fluctuating. Cloudflare data showed Google’s ratio rose from 3.3:1 in January to 18:1 in April, before settling at 9:1 in July. Google has stated it remains committed to sending traffic to the web.