Please stop externalizing your costs directly into my face by Tomte

Share This Article

Sed ut perspiciatis unde.

Over the past few months, instead of working on our priorities at SourceHut, I
have spent anywhere from 20-100% of my time in any given week mitigating
hyper-aggressive LLM crawlers at scale. This isn’t the first time SourceHut has
been at the wrong end of some malicious bullshit or paid someone else’s
externalized costs – every couple of years someone invents a new way of ruining
my day.

Four years ago, we decided to require payment to use our CI services
because it was being abused to mine cryptocurrency. We alternated between
periods of designing and deploying tools to curb this abuse and periods of
near-complete outage when they adapted to our mitigations and saturated all of
our compute with miners seeking a profit. It was bad enough having to beg my
friends and family to avoid “investing” in the scam without having the scam
break into my business and trash the place every day.

Two years ago, we threatened to blacklist the Go module mirror because for
some reason the Go team thinks that running terabytes of git clones all day,
every day for every Go project on git.sr.ht is cheaper than maintaining any
state or using webhooks or coordinating the work between instances or even just
designing a module system that doesn’t require Google to DoS git forges whose
entire annual budgets are considerably smaller than a single Google engineer’s
salary.

Now it’s LLMs. If you think these crawlers respect robots.txt then you are
several assumptions of good faith removed from reality. These bots crawl
everything they can find, robots.txt be damned, including expensive endpoints
like git blame, every page of every git log,

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (20)

20 Comments

Post Author

easton

Posted March 18, 2025 at 10:00 am

It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?

Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.

(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)

0Likes Log in to Reply
Post Author

MathMonkeyMan

Posted March 18, 2025 at 10:02 am

Good rant!

The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

0Likes Log in to Reply
Post Author

colordrops

Posted March 18, 2025 at 10:05 am

I had a friend that said he'd never get a mobile phone, and he did hold out until maybe 2010. He eventually realized the world had changed.

0Likes Log in to Reply
Post Author

thedevilslawyer

Posted March 18, 2025 at 10:06 am

Was with drew until the solution:

>Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.

Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.

Here's potential solutions:
1) Cloudflare.
2) Own DDoS protecting reverse proxy using captcha etc.
3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.

0Likes Log in to Reply
Post Author

fungiblecog

Posted March 18, 2025 at 10:07 am

Welcome to 2025…

0Likes Log in to Reply
Post Author

jisnsm

Posted March 18, 2025 at 10:07 am

I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.

0Likes Log in to Reply
Post Author

myaccountonhn

Posted March 18, 2025 at 10:10 am

I really wonder how to best deal with this issue, almost seems like all web traffic needs to be behind CDNs which is horrible for the open web.

0Likes Log in to Reply
Post Author

h4kor

Posted March 18, 2025 at 10:12 am

I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.

0Likes Log in to Reply
Post Author

kelseydh

Posted March 18, 2025 at 10:17 am

The internet you knew and mastered no longer exists.

0Likes Log in to Reply
Post Author

sussmannbaka

Posted March 18, 2025 at 10:18 am

just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill.
If you ever ask yourself why everybody else hates you and your ilk, this attitude is why.
The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?

0Likes Log in to Reply
Post Author

me2too

Posted March 18, 2025 at 10:19 am

He's just right

0Likes Log in to Reply
Post Author

stavros

Posted March 18, 2025 at 10:22 am

What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.

0Likes Log in to Reply
Post Author

throwawayffffas

Posted March 18, 2025 at 10:25 am

> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.

0Likes Log in to Reply
Post Author

InDubioProRubio

Posted March 18, 2025 at 10:26 am

All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.

Once the allmende, the grass runs out, things get interesting. We shall see Cyperpunk like computational parasitism plaguing companies and attempts to filter these work out. I guess that is the only way, to really prevent that sort of bot. You take arbitrary pieces of the UOW they want done- and reject them on principal. And depending on the cleverness of the batch algorithm, they will come back with that again and again, identifying themselves via repetition.

0Likes Log in to Reply
Post Author

petercooper

Posted March 18, 2025 at 10:27 am

If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?

0Likes Log in to Reply
Post Author

secondcoming

Posted March 18, 2025 at 10:30 am

What would happen if these AI crawlers scraped data that was deliberately incorrect? Like a subtly broken algorithm implementation.

0Likes Log in to Reply
Post Author

dalke

Posted March 18, 2025 at 10:33 am
I have a small static site. I haven't touched it in a couple of years.

Even then, I see bot after bot, pulling down about 1/2 GB per day.

Like, I distribute Python wheels from my site, with several release versions X several Python versions.

I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:

Last-Modified: Thu, 25 May 2023 09:07:25 GMT ETag: "8c2f67-5fc80f2f3b3e6"

Well, I know the answer to the second question, as DeVault's title highlights – it's cheaper to re-read the data and re-process the content than set up a local cache.

Externalizing their costs onto me.

I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.

Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.
0Likes Log in to Reply
Post Author

CGamesPlay

Posted March 18, 2025 at 10:34 am

I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?

Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).

0Likes Log in to Reply
Post Author

iinnPP

Posted March 18, 2025 at 10:36 am

Assuming you can prove it's a company, doesn't the behavior equate to fraud? Seems to hit all the prongs, but I am no lawyer.

0Likes Log in to Reply
Post Author

sirodoht

Posted March 18, 2025 at 11:17 am

I disagree with the premise of this article, which is: the internet used to be so nice when it wasn't the center of humanity's activities but now everything sucks.

I read this: "I am sick and tired of having all of these costs externalized directly into my fucking face", as: "I am sick and tired of doing something for the world and the world not appreciating it because they want to do their own things".

Lack of appreciation is underrated as a problem because it's abstracted away due to the mechanics of the free market ("if you don't want it, just don't buy it"). Yet, markets are quite foundational in humanity's activities and a market is place where everybody can put a stall. Now, Drew can't have his stall and he's angry. But anger is rarely part of the solution.

His path is that of anger, unfortunately, with some hate added in too:

> "If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts."

0Likes Log in to Reply

Please stop externalizing your costs directly into my face by Tomte

Please stop externalizing your costs directly into my face by Tomte

Share This Article

Newsletter

HackTech

20 Comments

easton

MathMonkeyMan

colordrops

thedevilslawyer

fungiblecog

jisnsm

myaccountonhn

h4kor

kelseydh

sussmannbaka

me2too

stavros

throwawayffffas

InDubioProRubio

petercooper

secondcoming

dalke

CGamesPlay

iinnPP

sirodoht

Leave a comment Cancel reply

Editor's Choice

Please stop externalizing your costs directly into my face by Tomte

Please stop externalizing your costs directly into my face by Tomte

Share This Article

Newsletter

20 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter