Skip to content Skip to footer
0 items - $0.00 0

Please stop externalizing your costs directly into my face by Tomte

Please stop externalizing your costs directly into my face by Tomte

20 Comments

  • Post Author
    easton
    Posted March 18, 2025 at 10:00 am

    It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?

    Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.

    (Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)

  • Post Author
    MathMonkeyMan
    Posted March 18, 2025 at 10:02 am

    Good rant!

    The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

    > random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

  • Post Author
    colordrops
    Posted March 18, 2025 at 10:05 am

    I had a friend that said he'd never get a mobile phone, and he did hold out until maybe 2010. He eventually realized the world had changed.

  • Post Author
    thedevilslawyer
    Posted March 18, 2025 at 10:06 am

    Was with drew until the solution:

    >Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.

    Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.

    Here's potential solutions:
    1) Cloudflare.
    2) Own DDoS protecting reverse proxy using captcha etc.
    3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.

  • Post Author
    fungiblecog
    Posted March 18, 2025 at 10:07 am

    Welcome to 2025…

  • Post Author
    jisnsm
    Posted March 18, 2025 at 10:07 am

    I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.

  • Post Author
    myaccountonhn
    Posted March 18, 2025 at 10:10 am

    I really wonder how to best deal with this issue, almost seems like all web traffic needs to be behind CDNs which is horrible for the open web.

  • Post Author
    h4kor
    Posted March 18, 2025 at 10:12 am

    I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.

  • Post Author
    kelseydh
    Posted March 18, 2025 at 10:17 am

    The internet you knew and mastered no longer exists.

  • Post Author
    sussmannbaka
    Posted March 18, 2025 at 10:18 am

    just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill.
    If you ever ask yourself why everybody else hates you and your ilk, this attitude is why.
    The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?

  • Post Author
    me2too
    Posted March 18, 2025 at 10:19 am

    He's just right

  • Post Author
    stavros
    Posted March 18, 2025 at 10:22 am

    What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.

  • Post Author
    throwawayffffas
    Posted March 18, 2025 at 10:25 am

    > If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

    After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.

  • Post Author
    InDubioProRubio
    Posted March 18, 2025 at 10:26 am

    All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.

    Once the allmende, the grass runs out, things get interesting. We shall see Cyperpunk like computational parasitism plaguing companies and attempts to filter these work out. I guess that is the only way, to really prevent that sort of bot. You take arbitrary pieces of the UOW they want done- and reject them on principal. And depending on the cleverness of the batch algorithm, they will come back with that again and again, identifying themselves via repetition.

  • Post Author
    petercooper
    Posted March 18, 2025 at 10:27 am

    If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

    Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?

  • Post Author
    secondcoming
    Posted March 18, 2025 at 10:30 am

    What would happen if these AI crawlers scraped data that was deliberately incorrect? Like a subtly broken algorithm implementation.

  • Post Author
    dalke
    Posted March 18, 2025 at 10:33 am

    I have a small static site. I haven't touched it in a couple of years.

    Even then, I see bot after bot, pulling down about 1/2 GB per day.

    Like, I distribute Python wheels from my site, with several release versions X several Python versions.

    I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:

      Last-Modified: Thu, 25 May 2023 09:07:25 GMT
      ETag: "8c2f67-5fc80f2f3b3e6"
    

    Well, I know the answer to the second question, as DeVault's title highlights – it's cheaper to re-read the data and re-process the content than set up a local cache.

    Externalizing their costs onto me.

    I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.

    Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.

  • Post Author
    CGamesPlay
    Posted March 18, 2025 at 10:34 am

    I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?

    Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).

  • Post Author
    iinnPP
    Posted March 18, 2025 at 10:36 am

    Assuming you can prove it's a company, doesn't the behavior equate to fraud? Seems to hit all the prongs, but I am no lawyer.

  • Post Author
    sirodoht
    Posted March 18, 2025 at 11:17 am

    I disagree with the premise of this article, which is: the internet used to be so nice when it wasn't the center of humanity's activities but now everything sucks.

    I read this: "I am sick and tired of having all of these costs externalized directly into my fucking face", as: "I am sick and tired of doing something for the world and the world not appreciating it because they want to do their own things".

    Lack of appreciation is underrated as a problem because it's abstracted away due to the mechanics of the free market ("if you don't want it, just don't buy it"). Yet, markets are quite foundational in humanity's activities and a market is place where everybody can put a stall. Now, Drew can't have his stall and he's angry. But anger is rarely part of the solution.

    His path is that of anger, unfortunately, with some hate added in too:

    > "If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts."

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.