AI bots hungry for data are taking down FOSS sites by accident, but humans are fighting back.
Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known crawler user-agents, and filtering suspicious traffic—Iaso found that AI crawlers continued evading all attempts to stop them, spoofing user-agents and cycling through residential IP addresses as proxies.
Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating “Anubis,” a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site. “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more,” Iaso wrote in a blog post titled “a desperate cry for help.” “I don’t want to have to close off my Gitea server to the public, but I will if I have to.”
Iaso’s story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies’ bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.
Kevin Fenzi, a member of the Fedora Pagure project’s sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso’s “Anubis” system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE’s GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat.
While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously—such as when a GitLab link is shared in a chat room—site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.
The situation isn’t exactly new. In December, Dennis Schubert, who maintains infrastructure for the Diaspora social network, described the situation as “literally a DDoS on the entire internet” after discovering that AI companies accounted for 70 percent of all web requests to their services.
The costs are both technical and financial. The Read the Docs project reported that blocking AI crawlers immediately decreased their traff
27 Comments
ggm
Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.
People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)
grotorea
Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?
xena
Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!
edoloughlin
I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.
tedunangst
> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.
If the target goes down after you scrape it, that's a feature.
ipaddr
I've had a number of content sites
I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.
These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.
hoaxminion
[dead]
egypturnash
This is 100% off-topic but:
Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.
101008
I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)
hbcondo714
> many AI companies engage in web crawling
Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.
superkuh
While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.
j45
Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?
I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.
Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.
eevilspock
The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.
Their robber baron behavior reveals their true values and the reality of capitalism.
throwaway81523
Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.
haswell
Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.
rco8786
I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.
Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.
CyberDildonics
Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?
ANarrativeApe
Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior?
If so, what should the evolved license wording be?
I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt…
nashashmi
And this is why the Internet has become a maze of captcha’s.
kazinator
I've been seeing crawlers which report an Agent string like this:
Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.
When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.
The attack quickly abated.
userbinator
All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.
IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.
jrvarela56
This is going to start happening to brick and mortar businesses through their customer support channels
instagib
Can we force the bots to mine cryptocurrency?
Nckpz
I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"
internet101010
I block all traffic except that which comes from the country of Cloudflare.
rfurmani
After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.
yieldcrv
nice, a free way to keep our IPFS pins alive