This is a tarpit intended to catch web crawlers. Specifically, it’s targetting crawlers that scrape data
for LLM’s – but really, like the plants it is named after, it’ll eat just about anything that finds it’s
way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go
back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear
to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your
server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to
give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
You can take a look at what this looks like, here. (Note: VERY slow page loads!)
THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY.
DO NOT DEPLOY IF YOU AREN’T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.
LLM scrapers are relentless and brutual. You may be able to keep them at bay
with this software – but it works by providing them with a neverending stream
of exactly what they are looking for. YOU ARE LIKELY TO EXPERIENCE SIGNIFICANT
CONTINUOUS CPU LOAD, ESPECIALLY WITH THE MARKOV MODULE ENABLED.
There is not currently a way to differentiate between web crawlers that
are indexing sites for search purposes, vs crawlers that are training
AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR
FROM ALL SEARCH RESULTS.
Latest Version
Usage
Expected usage is to hide the tarpit behind nginx or Apache, or whatever else you have implemented your
site in. Directly exposing it to the internet is ill advised. We want it to look as innocent and normal
as possible; in addition HTTP headers are used to configure the tarpit.
I’ll be using nginx configurations for examples. Here’s a real world snippet for the demo above:
location /nepenthes-demo/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Prefix '/nepenthes-demo';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
You’ll see several headers are added here: “X-Prefix” tells the tarpit that all links should go to that
path. Make this match what is in the ‘location’ directive. X-Forwarded-For is optional, but will make any
statistics gathered significantly more useful.
The proxy_buffering directive is important. LLM crawlers typically disconnect if not given a response within
a few seconds; Nepenthes counters this by drip-feeding a few bytes at a time. Buffering breaks this workaround.
You can have multiple proxies to an individual Nepenthes instance; simply set the X-Prefix header accordingly.