Skip to content Skip to footer
0 items - $0.00 0

FOSS infrastructure is under attack by AI companies by todsacerdoti

FOSS infrastructure is under attack by AI companies by todsacerdoti

FOSS infrastructure is under attack by AI companies by todsacerdoti

29 Comments

  • Post Author
    totetsu
    Posted March 20, 2025 at 1:04 pm

    Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/

  • Post Author
    ericholscher
    Posted March 20, 2025 at 1:05 pm

    Yep — our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse… (quoted in the OP) — everyone I know has a similar story who is running large internet infrastructure — this post does a great job of rounding a bunch of them up in 1 place.

    I called it when I wrote it, they are just burning their goodwill to the ground.

    I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 — an engineer at the company saw our post and reached out, giving me the right email — which I then emailed 3x and never got a reply.

  • Post Author
    nonrandomstring
    Posted March 20, 2025 at 1:06 pm

    These are DDOS attacks and should be treated in law as such.
    (Although I do realise that in many countries now we no longer have
    any effective "rule of law")

  • Post Author
    WesolyKubeczek
    Posted March 20, 2025 at 1:12 pm

    …at some point, some people started appreciating mailing lists and the distributed nature of Git again.

  • Post Author
    roenxi
    Posted March 20, 2025 at 1:14 pm

    We're close to finding a clear use-case for Bitcoin with this one.

  • Post Author
    KolmogorovComp
    Posted March 20, 2025 at 1:14 pm

    My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?

  • Post Author
    brandonmenc
    Posted March 20, 2025 at 1:15 pm

    This article starts by citing a blog article – displays a screenshot of the article – but doesn't link to it.

  • Post Author
    wiredfool
    Posted March 20, 2025 at 1:18 pm

    Across my sites — mostly open data sites — the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.

    At this point, I think we're well under 1% actual users on a good day.

  • Post Author
    megadata
    Posted March 20, 2025 at 1:18 pm

    Perhaps time to start a central community ban pool for IP ranges?

  • Post Author
    sir-alien
    Posted March 20, 2025 at 1:19 pm

    It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.

  • Post Author
    napolux
    Posted March 20, 2025 at 1:19 pm

    In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.

    First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.

    What's worse? They completely ignore a simple HTTP 429 status.

  • Post Author
    briandear
    Posted March 20, 2025 at 1:21 pm

    Does GitHub have this problem?

  • Post Author
    xena
    Posted March 20, 2025 at 1:22 pm

    It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

  • Post Author
    aspir
    Posted March 20, 2025 at 1:22 pm

    Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)

    Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.

  • Post Author
    keepamovin
    Posted March 20, 2025 at 1:23 pm

    Adaptation – adapt or die. Find a business model that can sustain, without the naivety that people will pay for what they can take without consequence.

  • Post Author
    brushfoot
    Posted March 20, 2025 at 1:24 pm

    At this rate, it's more than FOSS infrastructure — although that's a canary in the coalmine I especially sympathize with — it's anonymous Internet access altogether.

    Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article — or even real user agents because they're wired up to something like Playwright.

    What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

  • Post Author
    marginalia_nu
    Posted March 20, 2025 at 1:24 pm

    I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.

    It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.

  • Post Author
    FeepingCreature
    Posted March 20, 2025 at 1:26 pm

    To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.

  • Post Author
    lelanthran
    Posted March 20, 2025 at 1:27 pm

    The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

    This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

    There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

    It'll all burn down.

  • Post Author
    immibis
    Posted March 20, 2025 at 1:27 pm

    This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.

  • Post Author
    QuadrupleA
    Posted March 20, 2025 at 1:28 pm

    Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.

  • Post Author
    rglullis
    Posted March 20, 2025 at 1:30 pm

    The days of an open web are long gone. Every server will eventually have to require authentication for access, and to get an account you will have to provide some form of payment or social proof.

    Hoenstly, I don't see it necessarily as a bad thing.

  • Post Author
    harhargange
    Posted March 20, 2025 at 1:30 pm

    Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.

  • Post Author
    diggan
    Posted March 20, 2025 at 1:31 pm

    > According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

    How do they know that these are LLM crawlers and not anything else?

  • Post Author
    nzeid
    Posted March 20, 2025 at 1:32 pm

    We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.

  • Post Author
    biophysboy
    Posted March 20, 2025 at 1:32 pm

    Can someone with more experience developing AI tools explain what these bots are mostly doing? Are they collecting data for training, or are they for the more recent search functionality? Or are they enhancing responses with links?

  • Post Author
    piokoch
    Posted March 20, 2025 at 1:33 pm

    Copyright the content and sue those who use it for AI training. I believe there is a lot of low hanging fruit for lawyers here. I would be surprised if they weren't preparing to hit Open AI and alikes. Very badly. Google get away with deep linking issues as publishers after all had some interest in being linked from the search engine, here publishers see zero value.

  • Post Author
    phkahler
    Posted March 20, 2025 at 1:36 pm

    So I'll just float an idea again that always gets rejected here. This is yet another problem that could be solved completely by… Eliminating anonymity by default on the internet.

    To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.

    Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.

  • Post Author
    throwaway173738
    Posted March 20, 2025 at 1:36 pm

    Does anyone maintain a list of ai products developed without scraping?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.