Skip to content Skip to footer
0 items - $0.00 0

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

Devs say AI crawlers dominate traffic, forcing blocks on entire countries by LinuxBender

27 Comments

  • Post Author
    ggm
    Posted March 25, 2025 at 9:51 pm

    Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.

    People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)

  • Post Author
    grotorea
    Posted March 25, 2025 at 10:18 pm

    Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?

  • Post Author
    xena
    Posted March 25, 2025 at 10:21 pm

    Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

  • Post Author
    edoloughlin
    Posted March 25, 2025 at 10:24 pm

    I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.

  • Post Author
    tedunangst
    Posted March 25, 2025 at 10:26 pm

    > It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

    If the target goes down after you scrape it, that's a feature.

  • Post Author
    ipaddr
    Posted March 25, 2025 at 11:06 pm

    I've had a number of content sites
    I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

    These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

  • Post Author
    hoaxminion
    Posted March 25, 2025 at 11:11 pm

    [dead]

  • Post Author
    egypturnash
    Posted March 25, 2025 at 11:51 pm

    This is 100% off-topic but:

    Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.

  • Post Author
    101008
    Posted March 25, 2025 at 11:54 pm

    I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)

  • Post Author
    hbcondo714
    Posted March 25, 2025 at 11:57 pm

    > many AI companies engage in web crawling

    Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.

  • Post Author
    superkuh
    Posted March 26, 2025 at 12:01 am

    While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.

  • Post Author
    j45
    Posted March 26, 2025 at 12:03 am

    Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?

    I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.

    Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.

  • Post Author
    eevilspock
    Posted March 26, 2025 at 12:14 am

    The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.

    Their robber baron behavior reveals their true values and the reality of capitalism.

  • Post Author
    throwaway81523
    Posted March 26, 2025 at 12:53 am

    Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.

  • Post Author
    haswell
    Posted March 26, 2025 at 1:57 am

    Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.

  • Post Author
    rco8786
    Posted March 26, 2025 at 1:59 am

    I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

    Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

  • Post Author
    CyberDildonics
    Posted March 26, 2025 at 2:07 am

    Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?

  • Post Author
    ANarrativeApe
    Posted March 26, 2025 at 2:12 am

    Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior?
    If so, what should the evolved license wording be?

    I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt…

  • Post Author
    nashashmi
    Posted March 26, 2025 at 2:12 am

    And this is why the Internet has become a maze of captcha’s.

  • Post Author
    kazinator
    Posted March 26, 2025 at 2:26 am

    I've been seeing crawlers which report an Agent string like this:

      Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36
    

    Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.

    When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.

    The attack quickly abated.

  • Post Author
    userbinator
    Posted March 26, 2025 at 2:40 am

    All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

    IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

  • Post Author
    jrvarela56
    Posted March 26, 2025 at 2:56 am

    This is going to start happening to brick and mortar businesses through their customer support channels

  • Post Author
    instagib
    Posted March 26, 2025 at 3:05 am

    Can we force the bots to mine cryptocurrency?

  • Post Author
    Nckpz
    Posted March 26, 2025 at 3:10 am

    I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

  • Post Author
    internet101010
    Posted March 26, 2025 at 3:53 am

    I block all traffic except that which comes from the country of Cloudflare.

  • Post Author
    rfurmani
    Posted March 26, 2025 at 3:57 am

    After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

  • Post Author
    yieldcrv
    Posted March 26, 2025 at 4:19 am

    nice, a free way to keep our IPFS pins alive

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.