OpenAI has just over a week to comply with European data protection laws following a temporary ban in Italy and a slew of investigations in other EU countries. If it fails, it could face hefty fines, be forced to delete data, or even be banned.
But experts have told MIT Technology Review that it will be next to impossible for OpenAI to comply with the rules. That’s because of the way data used to train its AI models has been collected: by hoovering up content off the internet.
In AI development, the dominant paradigm is that the more training data, the better. OpenAI’s GPT-2 model had a data set consisting of 40 gigabytes of text. GPT-3, which ChatGPT is based on, was trained on 570 GB of data. OpenAI has not shared how big the data set for its latest model, GPT-4, is.
But that hunger for larger models is now coming back to bite the company. In the past few weeks, several Western data protection authorities have started investigations into how OpenAI collects and processes the data powering ChatGPT. They believe it has scraped people’s personal data, such as names or email addresses, and used it without their consent.
The Italian authority has blocked the use of ChatGPT as a precautionary measure, and French, German, Irish, and Canadian data regulators are also investigating how the OpenAI system collects and uses data. The European Data Protection Board, the umbrella organization for data protection authorities, is also setting up an EU-wide task force to coordinate investigations and enforcement around ChatGPT.
Italy has given OpenAI until April 30 to comply with the law. This would mean OpenAI would have to ask people for consent to have their data scraped, or prove that it has a “legitimate interest” in collecting it. OpenAI will also have to explain to people how ChatGPT uses their data and give them the power to correct any mistakes about them that the chatbot spits out, to have their data erased if they want, and to object to letting the computer program use it.
If OpenAI cannot convince the authorities its data use practices are lega