The boom of generative AI products over the past few months has prompted many websites to take countermeasures.
The basic concern goes like this:
AI products depend on consuming large volumes of content to train their language models (the so-called large language models, or LLMs for short), and this content has to come from somewhere. AI companies see the openness of the web as permitting large-scale crawling to obtain training data, but some website operators disagree, including Reddit, Stack Overflow and Twitter.
This answer to this interesting question will no doubt be litigated in courts around the world.
This article will explore this question, focusing on the business and technical aspects. But before we dive in, a few points:
- Although this topic touches on, and I include in this article, some legal arguments, I am not a lawyer, I am not your lawyer, and I am not giving you any advice of any sort. Talk to your favorite lawyer cat if you need legal advice.
- I used to work at Google many years ago, mostly in web search. I do not speak on behalf of Google in any way shape or form, even when I cite some Google examples below.
- This is a fast-moving topic. It is guaranteed that between the time I’ve finished writing this and you are reading it, something major would have happened in the industry, and it’s guaranteed I would have missed something!
The ‘deal’ between search engines and websites
We begin with how a modern search engine, like Google or Bing, works. In overly simplified terms, a search engine works like this:
- The search engine has a list of URLs. Each URL has metadata (sometimes called “signals”) that indicate the URL may be important or useful to show in the search engine’s results pages.
- Based on these signals, the search engine has a crawler, a bot, which is a program that fetches these URLs in some order of “importance” based on what the signals indicate. For this purpose, Google’s crawler is called Googlebot and Bing’s is Bingbot (and both have many more for other purposes, like ads). Both bots identify themselves in the user-agent header, and both can be verified programmatically by websites to be sure that the content is being served to the real search engine bot and not a spoof.
- Once the content is fetched, it is indexed. Search engine indices are complicated databases that contain the page content along with a huge amount of metadata and other signals used to match and rank the content to user queries. An index is what actually gets searched when you type in a query in Google or Bing.
Modern search engines, the good polite ones at least, give the website operator full control over the crawling and indexing.
The Robots Exclusion Protocol is how this control is implemented, through the robots.txt file, and meta tags or headers on the web page itself. These search engines voluntarily obey the Robots Exclusion Protocol, taking a website’s implementation of the Protocol as a directive, an absolute command, not just a mere hint.
Importantly, the default position of the Protocol is that all crawling and indexing are allowed – it is permissive by default. Unless the website operator actively takes steps to implement exclusion, the website is deemed to allow crawling and indexing.
This gives us the basic framework of the deal between the search engines and websites: By default, a website will be crawled and indexed by a search engine, which, in turn, point searchers directly to the original website in their search results for relevant queries.
This deal is fundamentally an economic exchange: the costs of producing, hosting, and serving the content are incurred by the website, but the idea is that the traffic it gets in return pays that back with a profit.
Note: I’m intentionally ignoring a whole slew of related arguments here, about who has more power in this exchange, who makes more money, fairness, and much more. I’m not belittling these – I just don’t want to distract from the core topic of this article.
This indexing for traffic approach comes up elsewhere, for example when search engines are allowed to index content behind a paywall. It’s the same idea: the website shares content in return for being shown in search results that point searchers back to the website directly.
And at each step of the process of this deal, if the publisher wants to block all or some crawling or indexing in any way, then the publisher has several tools using the Robots and Exclusion Protocol. Anything still allowed to be crawled and indexed is because the website gets a direct benefit from being shown in the search results.
This argument in some form has actually been used in courts, in what has become to be known as the “robots.txt defense” and has been basically held up; see this short list of court cases, many involving Google, and this write-up from 2007 that’s not entirely happy about it.
LLMs are not search engines
It should now be very clear that an LLM