On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack.
He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site.
“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.”
OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions.
“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.
“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”
Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models.
It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics.
Tomchuk’s team, based in Ukraine but also licensed in the U.S. out of Tampa, Florida, has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags specifically telling OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, that have their own tags, according to its information page on its crawlers.)
Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its informational page that it honors such files when configured with its own set of