FOSS infrastructure is under attack by AI companies by todsacerdoti

Share This Article

Sed ut perspiciatis unde.

Three days ago, Drew DeVault – founder and CEO of SourceHut – published a blogpost called, “Please stop externalizing your costs directly into my face”, where he complained that LLM companies were crawling data without respecting robosts.txt and causing severe outages to SourceHut.

I went, “Interesting!”, and moved on.

Then, yesterday morning, KDE GitLab infrastructure was overwhelmed by another AI crawler, with IPs from an Alibaba range; this caused GitLab to be temporarily inaccessible by KDE developers.

I then discovered that, one week ago, an Anime girl started appearing on the GNOME GitLab instance, as the page was loaded. It turns out that it’s the default loading page for Anubis, a proof-of-work challenger that blocks AI scrapers that are causing outages.

By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and – since FOSS software relies on public collaboration, whereas private companies don’t have that requirement – this is putting some extra burden on Open Source communities.

So let’s try to get more details – going back to Drew’s blogpost. According to Drew, LLM crawlers don’t respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

Due to this, it’s hard to come off with a good set of mitigations. Drew says that several high-priority tasks have been delayed for weeks or months due to these interruptions, users have been occasionally affected (because it’s hard to distinguish bots and humans), and – of course – this causes occasional outages of SourceHut.

Drew here does not distinguish between which AI companies are more or less respectful of robots.txt files, or more accurate in their user agent reporting; we’ll be able to look more into that later.

Finally, Drew points out that this is not some isolated issue. He says,

All of my sysadmin friends are dealing with the same problems, [and] every time I sit down for beers or dinner to socialize with sysadmin friends it’s not long before we’re complaining about the bots. […] The desperation in these conversations is palpable.

Which brings me back to yesterday’s KDE GitLab issues. According to Ben, part of the KDE sysadmin team, all of the IPs that were performing this DDoS were claiming to be MS Edge, and were due to Chinese AI companies; he mentions that Western LLM operators, such as OpenAI and Anthropic, were at least setting a proper UA – again, more on this later.

The solution – for now – was to ban the version of Edge that the bots were claiming to be, though it’s hard to believe that this will be a definitive solution; these bots do seem keen on changing user agents to try to blend in as much as possible.

Indeed, GNOME has been experiencing issues since a last November; as a temporary solution they had rate-limited non-logged in users from seeing merge requests and commits, which obviously also caused issues for real human guests.

The solution the eventually settled to was switching to Anubis. This is a page that presents a challenge to the browser, which then has to spend time doing some math and presenting the solution back to the server. If it’s right, you get access to the website.

According to the developer, this project is “a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand. I hate that I have to do this, but this is what we get for the modern Internet because bots don’t conform to standards like robots.txt, even when they claim to”.

However, this is also causing user issues. When a lot of people open the link from the same place, it might happen that they get served some higher-difficulty exercise that will take some time to complete; there’s one user reporting one minute delay, and another – from his phone – having to wait around two minutes.

Why? Wel

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (29)

29 Comments

Post Author

totetsu

Posted March 20, 2025 at 1:04 pm

Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/

0Likes Log in to Reply
Post Author

ericholscher

Posted March 20, 2025 at 1:05 pm

Yep — our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse… (quoted in the OP) — everyone I know has a similar story who is running large internet infrastructure — this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 — an engineer at the company saw our post and reached out, giving me the right email — which I then emailed 3x and never got a reply.

0Likes Log in to Reply
Post Author

nonrandomstring

Posted March 20, 2025 at 1:06 pm

These are DDOS attacks and should be treated in law as such.
(Although I do realise that in many countries now we no longer have
any effective "rule of law")

0Likes Log in to Reply
Post Author

WesolyKubeczek

Posted March 20, 2025 at 1:12 pm

…at some point, some people started appreciating mailing lists and the distributed nature of Git again.

0Likes Log in to Reply
Post Author

roenxi

Posted March 20, 2025 at 1:14 pm

We're close to finding a clear use-case for Bitcoin with this one.

0Likes Log in to Reply
Post Author

KolmogorovComp

Posted March 20, 2025 at 1:14 pm

My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?

0Likes Log in to Reply
Post Author

brandonmenc

Posted March 20, 2025 at 1:15 pm

This article starts by citing a blog article – displays a screenshot of the article – but doesn't link to it.

0Likes Log in to Reply
Post Author

wiredfool

Posted March 20, 2025 at 1:18 pm

Across my sites — mostly open data sites — the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.

At this point, I think we're well under 1% actual users on a good day.

0Likes Log in to Reply
Post Author

megadata

Posted March 20, 2025 at 1:18 pm

Perhaps time to start a central community ban pool for IP ranges?

0Likes Log in to Reply
Post Author

sir-alien

Posted March 20, 2025 at 1:19 pm

It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.

0Likes Log in to Reply
Post Author

napolux

Posted March 20, 2025 at 1:19 pm

In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.

First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.

What's worse? They completely ignore a simple HTTP 429 status.

0Likes Log in to Reply
Post Author

briandear

Posted March 20, 2025 at 1:21 pm

Does GitHub have this problem?

0Likes Log in to Reply
Post Author

xena

Posted March 20, 2025 at 1:22 pm

It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

0Likes Log in to Reply
Post Author

aspir

Posted March 20, 2025 at 1:22 pm

Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)

Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.

0Likes Log in to Reply
Post Author

keepamovin

Posted March 20, 2025 at 1:23 pm

Adaptation – adapt or die. Find a business model that can sustain, without the naivety that people will pay for what they can take without consequence.

0Likes Log in to Reply
Post Author

brushfoot

Posted March 20, 2025 at 1:24 pm

At this rate, it's more than FOSS infrastructure — although that's a canary in the coalmine I especially sympathize with — it's anonymous Internet access altogether.

Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article — or even real user agents because they're wired up to something like Playwright.

What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

0Likes Log in to Reply
Post Author

marginalia_nu

Posted March 20, 2025 at 1:24 pm

I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.

It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.

0Likes Log in to Reply
Post Author

FeepingCreature

Posted March 20, 2025 at 1:26 pm

To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.

0Likes Log in to Reply
Post Author

lelanthran

Posted March 20, 2025 at 1:27 pm

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.

0Likes Log in to Reply
Post Author

immibis

Posted March 20, 2025 at 1:27 pm

This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.

0Likes Log in to Reply
Post Author

QuadrupleA

Posted March 20, 2025 at 1:28 pm

Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.

0Likes Log in to Reply
Post Author

rglullis

Posted March 20, 2025 at 1:30 pm

The days of an open web are long gone. Every server will eventually have to require authentication for access, and to get an account you will have to provide some form of payment or social proof.

Hoenstly, I don't see it necessarily as a bad thing.

0Likes Log in to Reply
Post Author

harhargange

Posted March 20, 2025 at 1:30 pm

Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.

0Likes Log in to Reply
Post Author

diggan

Posted March 20, 2025 at 1:31 pm

> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

How do they know that these are LLM crawlers and not anything else?

0Likes Log in to Reply
Post Author

nzeid

Posted March 20, 2025 at 1:32 pm

We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.

0Likes Log in to Reply
Post Author

biophysboy

Posted March 20, 2025 at 1:32 pm

Can someone with more experience developing AI tools explain what these bots are mostly doing? Are they collecting data for training, or are they for the more recent search functionality? Or are they enhancing responses with links?

0Likes Log in to Reply
Post Author

piokoch

Posted March 20, 2025 at 1:33 pm

Copyright the content and sue those who use it for AI training. I believe there is a lot of low hanging fruit for lawyers here. I would be surprised if they weren't preparing to hit Open AI and alikes. Very badly. Google get away with deep linking issues as publishers after all had some interest in being linked from the search engine, here publishers see zero value.

0Likes Log in to Reply
Post Author

phkahler

Posted March 20, 2025 at 1:36 pm

So I'll just float an idea again that always gets rejected here. This is yet another problem that could be solved completely by… Eliminating anonymity by default on the internet.

To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.

Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.

0Likes Log in to Reply
Post Author

throwaway173738

Posted March 20, 2025 at 1:36 pm

Does anyone maintain a list of ai products developed without scraping?

0Likes Log in to Reply

FOSS infrastructure is under attack by AI companies by todsacerdoti

FOSS infrastructure is under attack by AI companies by todsacerdoti

Share This Article

Newsletter

HackTech

29 Comments

totetsu

ericholscher

nonrandomstring

WesolyKubeczek

roenxi

KolmogorovComp

brandonmenc

wiredfool

megadata

sir-alien

napolux

briandear

xena

aspir

keepamovin

brushfoot

marginalia_nu

FeepingCreature

lelanthran

immibis

QuadrupleA

rglullis

harhargange

diggan

nzeid

biophysboy

piokoch

phkahler

throwaway173738

Leave a comment Cancel reply

Editor's Choice

FOSS infrastructure is under attack by AI companies by todsacerdoti

FOSS infrastructure is under attack by AI companies by todsacerdoti

Share This Article

Newsletter

29 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter