Decrypt – As AI models rapidly consume publicly available online content, experts warn of an impending data crisis: What happens when there’s nothing left to train on?
A recent Copyleaks report revealed that DeepSeek, a Chinese AI model, often produces responses nearly identical to ChatGPT, raising concerns that it may have been trained on OpenAI outputs. This has sparked fears that the era of readily available, high-quality training data is ending.
Google CEO Sundar Pichai acknowledged this challenge in December, cautioning that AI developers are depleting the supply of free, high-quality data. Speaking at the New York Times’ DealBook Summit, he noted that while a few companies currently dominate AI, future advancements will become increasingly difficult.
Rise of Synthetic Data
With limited access to real-world datasets, AI researchers are turning to synthetic data—artificially generated datasets that mimic real-world information. Though used in statistics since the 1960s, synthetic data is now becoming crucial in AI training as privacy laws and content restrictions tighten.
Muriel Médard, an MIT professor and co-founder of decentralized memory platform Optimum, explained that synthetic data helps bridge the gap when real-world data is inaccessible. “You either search for more or generate it based on what you have,” she said. However, she noted that retrieving and updating data—especially in decentralized systems—remains a challenge.
Nick Sanchez, Senior Solutions Architect at Druid AI, added that synthetic data will become essential as privacy restrictions grow. “While it’s not a perfect solution—since biases from real-world data can persist—it plays a key role in addressing consent, copyright, and privacy concerns,” he said.
Risks and Concerns
Despite its advantages, synthetic data also carries risks. Sanchez warned that bad actors could manipulate AI models by injecting false information into training sets, particularly in sensitive areas like fraud detection.
Blockchain technology could help mitigate these risks by ensuring data integrity. Médard emphasized that the goal isn’t to make data unchangeable but tamper-proof. “When people talk about immutability, they really mean durability,” she explained.
As AI advances, the reliance on synthetic data is expected to grow, raising questions about its reliability, ethical implications, and long-term impact on machine learning.