Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

ByHackTech 3 days ago

27Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

AI bots hungry for data are taking down FOSS sites by accident, but humans are fighting back.

Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known crawler user-agents, and filtering suspicious traffic—Iaso found that AI crawlers continued evading all attempts to stop them, spoofing user-agents and cycling through residential IP addresses as proxies.

Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating “Anubis,” a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site. “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more,” Iaso wrote in a blog post titled “a desperate cry for help.” “I don’t want to have to close off my Gitea server to the public, but I will if I have to.”

Iaso’s story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies’ bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.

Kevin Fenzi, a member of the Fedora Pagure project’s sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso’s “Anubis” system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE’s GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat.

While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously—such as when a GitLab link is shared in a chat room—site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.

The situation isn’t exactly new. In December, Dennis Schubert, who maintains infrastructure for the Diaspora social network, described the situation as “literally a DDoS on the entire internet” after discovering that AI companies accounted for 70 percent of all web requests to their services.

The costs are both technical and financial. The Read the Docs project reported that blocking AI crawlers immediately decreased their traff

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (27)

27 Comments

Post Author

ggm

Posted March 25, 2025 at 9:51 pm

Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.

People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)

0Likes Log in to Reply
Post Author

grotorea

Posted March 25, 2025 at 10:18 pm

Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?

0Likes Log in to Reply
Post Author

xena

Posted March 25, 2025 at 10:21 pm

Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

0Likes Log in to Reply
Post Author

edoloughlin

Posted March 25, 2025 at 10:24 pm

I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.

0Likes Log in to Reply
Post Author

tedunangst

Posted March 25, 2025 at 10:26 pm

> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

If the target goes down after you scrape it, that's a feature.

0Likes Log in to Reply
Post Author

ipaddr

Posted March 25, 2025 at 11:06 pm

I've had a number of content sites
I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

0Likes Log in to Reply
Post Author

hoaxminion

Posted March 25, 2025 at 11:11 pm

[dead]

0Likes Log in to Reply
Post Author

egypturnash

Posted March 25, 2025 at 11:51 pm

This is 100% off-topic but:

Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.

0Likes Log in to Reply
Post Author

101008

Posted March 25, 2025 at 11:54 pm

I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)

0Likes Log in to Reply
Post Author

hbcondo714

Posted March 25, 2025 at 11:57 pm

> many AI companies engage in web crawling

Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.

0Likes Log in to Reply
Post Author

superkuh

Posted March 26, 2025 at 12:01 am

While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.

0Likes Log in to Reply
Post Author

j45

Posted March 26, 2025 at 12:03 am

Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?

I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.

Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.

0Likes Log in to Reply
Post Author

eevilspock

Posted March 26, 2025 at 12:14 am

The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.

Their robber baron behavior reveals their true values and the reality of capitalism.

0Likes Log in to Reply
Post Author

throwaway81523

Posted March 26, 2025 at 12:53 am

Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.

0Likes Log in to Reply
Post Author

haswell

Posted March 26, 2025 at 1:57 am

Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.

0Likes Log in to Reply
Post Author

rco8786

Posted March 26, 2025 at 1:59 am

I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

0Likes Log in to Reply
Post Author

CyberDildonics

Posted March 26, 2025 at 2:07 am

Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?

0Likes Log in to Reply
Post Author

ANarrativeApe

Posted March 26, 2025 at 2:12 am

Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior?
If so, what should the evolved license wording be?

I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt…

0Likes Log in to Reply
Post Author

nashashmi

Posted March 26, 2025 at 2:12 am

And this is why the Internet has become a maze of captcha’s.

0Likes Log in to Reply
Post Author

kazinator

Posted March 26, 2025 at 2:26 am
I've been seeing crawlers which report an Agent string like this:

Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36

Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.

When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.

The attack quickly abated.
0Likes Log in to Reply
Post Author

userbinator

Posted March 26, 2025 at 2:40 am

All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

0Likes Log in to Reply
Post Author

jrvarela56

Posted March 26, 2025 at 2:56 am

This is going to start happening to brick and mortar businesses through their customer support channels

0Likes Log in to Reply
Post Author

instagib

Posted March 26, 2025 at 3:05 am

Can we force the bots to mine cryptocurrency?

0Likes Log in to Reply
Post Author

Nckpz

Posted March 26, 2025 at 3:10 am

I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

0Likes Log in to Reply
Post Author

internet101010

Posted March 26, 2025 at 3:53 am

I block all traffic except that which comes from the country of Cloudflare.

0Likes Log in to Reply
Post Author

rfurmani

Posted March 26, 2025 at 3:57 am

After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

0Likes Log in to Reply
Post Author

yieldcrv

Posted March 26, 2025 at 4:19 am

nice, a free way to keep our IPFS pins alive

0Likes Log in to Reply

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Share This Article

Newsletter

HackTech

27 Comments

ggm

grotorea

xena

edoloughlin

tedunangst

ipaddr

hoaxminion

egypturnash

101008

hbcondo714

superkuh

j45

eevilspock

throwaway81523

haswell

rco8786

CyberDildonics

ANarrativeApe

nashashmi

kazinator

userbinator

jrvarela56

instagib

Nckpz

internet101010

rfurmani

yieldcrv

Leave a comment Cancel reply

Editor's Choice

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Share This Article

Newsletter

27 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter