Three days ago, Drew DeVault – founder and CEO of SourceHut – published a blogpost called, “Please stop externalizing your costs directly into my face”, where he complained that LLM companies were crawling data without respecting robosts.txt and causing severe outages to SourceHut.

I went, “Interesting!”, and moved on.
Then, yesterday morning, KDE GitLab infrastructure was overwhelmed by another AI crawler, with IPs from an Alibaba range; this caused GitLab to be temporarily inaccessible by KDE developers.

I then discovered that, one week ago, an Anime girl started appearing on the GNOME GitLab instance, as the page was loaded. It turns out that it’s the default loading page for Anubis, a proof-of-work challenger that blocks AI scrapers that are causing outages.

By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and – since FOSS software relies on public collaboration, whereas private companies don’t have that requirement – this is putting some extra burden on Open Source communities.
So let’s try to get more details – going back to Drew’s blogpost. According to Drew, LLM crawlers don’t respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

Due to this, it’s hard to come off with a good set of mitigations. Drew says that several high-priority tasks have been delayed for weeks or months due to these interruptions, users have been occasionally affected (because it’s hard to distinguish bots and humans), and – of course – this causes occasional outages of SourceHut.

Drew here does not distinguish between which AI companies are more or less respectful of robots.txt files, or more accurate in their user agent reporting; we’ll be able to look more into that later.
Finally, Drew points out that this is not some isolated issue. He says,
All of my sysadmin friends are dealing with the same problems, [and] every time I sit down for beers or dinner to socialize with sysadmin friends it’s not long before we’re complaining about the bots. […] The desperation in these conversations is palpable.

Which brings me back to yesterday’s KDE GitLab issues. According to Ben, part of the KDE sysadmin team, all of the IPs that were performing this DDoS were claiming to be MS Edge, and were due to Chinese AI companies; he mentions that Western LLM operators, such as OpenAI and Anthropic, were at least setting a proper UA – again, more on this later.

The solution – for now – was to ban the version of Edge that the bots were claiming to be, though it’s hard to believe that this will be a definitive solution; these bots do seem keen on changing user agents to try to blend in as much as possible.
Indeed, GNOME has been experiencing issues since a last November; as a temporary solution they had rate-limited non-logged in users from seeing merge requests and commits, which obviously also caused issues for real human guests.

The solution the eventually settled to was switching to Anubis. This is a page that presents a challenge to the browser, which then has to spend time doing some math and presenting the solution back to the server. If it’s right, you get access to the website.

According to the developer, this project is “a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand. I hate that I have to do this, but this is what we get for the modern Internet because bots don’t conform to standards like robots.txt, even when they claim to”.

However, this is also causing user issues. When a lot of people open the link from the same place, it might happen that they get served some higher-difficulty exercise that will take some time to complete; there’s one user reporting one minute delay, and another – from his phone – having to wait around two minutes.

Why? Wel
29 Comments
totetsu
Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/
ericholscher
Yep — our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse… (quoted in the OP) — everyone I know has a similar story who is running large internet infrastructure — this post does a great job of rounding a bunch of them up in 1 place.
I called it when I wrote it, they are just burning their goodwill to the ground.
I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 — an engineer at the company saw our post and reached out, giving me the right email — which I then emailed 3x and never got a reply.
nonrandomstring
These are DDOS attacks and should be treated in law as such.
(Although I do realise that in many countries now we no longer have
any effective "rule of law")
WesolyKubeczek
…at some point, some people started appreciating mailing lists and the distributed nature of Git again.
roenxi
We're close to finding a clear use-case for Bitcoin with this one.
KolmogorovComp
My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?
brandonmenc
This article starts by citing a blog article – displays a screenshot of the article – but doesn't link to it.
wiredfool
Across my sites — mostly open data sites — the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.
At this point, I think we're well under 1% actual users on a good day.
megadata
Perhaps time to start a central community ban pool for IP ranges?
sir-alien
It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.
napolux
In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.
First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.
What's worse? They completely ignore a simple HTTP 429 status.
briandear
Does GitHub have this problem?
xena
It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.
aspir
Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)
Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.
keepamovin
Adaptation – adapt or die. Find a business model that can sustain, without the naivety that people will pay for what they can take without consequence.
brushfoot
At this rate, it's more than FOSS infrastructure — although that's a canary in the coalmine I especially sympathize with — it's anonymous Internet access altogether.
Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article — or even real user agents because they're wired up to something like Playwright.
What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
marginalia_nu
I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.
It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.
FeepingCreature
To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.
lelanthran
The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
It'll all burn down.
immibis
This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.
QuadrupleA
Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.
rglullis
The days of an open web are long gone. Every server will eventually have to require authentication for access, and to get an account you will have to provide some form of payment or social proof.
Hoenstly, I don't see it necessarily as a bad thing.
harhargange
Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.
diggan
> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.
How do they know that these are LLM crawlers and not anything else?
nzeid
We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.
biophysboy
Can someone with more experience developing AI tools explain what these bots are mostly doing? Are they collecting data for training, or are they for the more recent search functionality? Or are they enhancing responses with links?
piokoch
Copyright the content and sue those who use it for AI training. I believe there is a lot of low hanging fruit for lawyers here. I would be surprised if they weren't preparing to hit Open AI and alikes. Very badly. Google get away with deep linking issues as publishers after all had some interest in being linked from the search engine, here publishers see zero value.
phkahler
So I'll just float an idea again that always gets rejected here. This is yet another problem that could be solved completely by… Eliminating anonymity by default on the internet.
To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.
Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.
throwaway173738
Does anyone maintain a list of ai products developed without scraping?