AI Keeps Lying.
The Fix Isn’t a Better Model.
It’s Better Data.
OpenAI formally admitted it in 2025: language models are structurally rewarded to guess rather than say “I don’t know.” A growing field of blockchain-based data infrastructure projects believes the real fix happens before the model is ever trained—at the data layer itself.
The Test That Punishes Honesty
Picture a standardized exam where leaving a question blank gives you zero, but a wrong guess gives you a chance at a point. Rational test-takers guess. Always. Now imagine your AI is trained on exactly that exam, at scale, across billions of questions. You haven’t built a truthful system—you’ve built an optimized guesser.
This isn’t a metaphor. It is the exact structural critique published by OpenAI researchers in September 2025. Their paper, Why Language Models Hallucinate, argues that models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. Saying “I don’t know” scores zero. Confidently guessing wrong at least has a chance of scoring something—so over thousands of benchmark questions, guessing pays.
“Hallucinations are not a mysterious artifact of neural networks. They are a predictable outcome of how we train and evaluate language models.”— Kalai, Nachum, Vempala, Zhang · OpenAI Research, 2025
The paper’s core insight is structural, not incidental: accuracy-only leaderboards dominate how the entire field evaluates models. On those scoreboards, a model that guesses boldly—and occasionally gets lucky—outranks a model that abstains with honest uncertainty. The scoring hasn’t fundamentally changed, so neither has the behavior.
For a question a model doesn’t know—say, a specific person’s birthday—guessing “September 10” gives a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Multiplied across millions of training examples, the statistical pressure to guess is enormous. This is not a prompt engineering problem. It is baked into how AI is scored.
Scaling Hasn’t Fixed It
OpenAI acknowledges that GPT-5 significantly reduces hallucinations—especially in reasoning tasks—but still confirms they occur. The Vectara FaithJudge Leaderboard in 2025 put grounded hallucination rates at roughly 15–16% for GPT-4o and Claude 3.7 Sonnet, with Gemini 2.5 Flash around 6%. Those are meaningful improvements. They are not solutions. Even a 6% hallucination rate, considered excellent by benchmark standards, translates into serious operational errors at scale—a corrupted field in a medical record, a fabricated legal citation, a wrong fact embedded in a financial report.
More parameters did not fix this. Larger context windows did not fix this. The problem isn’t the model’s size—it’s the incentive to guess, which is written into the scoring system that shapes training.
Garbage In, Hallucinations Out
Training incentives are one dimension of the hallucination problem. Data quality is another—and arguably the deeper one. When models are trained on vast swaths of the internet, they absorb noise, contradictions, outdated facts, fabrications, and synthetic text at scale. The model has no reliable way to distinguish between a peer-reviewed paper and a confidently written blog post that happens to be completely wrong.
OpenAI researchers describe this as the GIGO problem—Garbage In, Garbage Out. Post-training techniques like RLHF can reduce errors like common misconceptions and conspiracy theories, but they cannot fundamentally undo what was baked in during pretraining. If the data layer is polluted, the model is polluted. No amount of fine-tuning fully reverses that.
The bias problem is equally serious. AI trained on data that skews toward wealthy, English-speaking markets will perform well in those markets and fail quietly everywhere else. Markus Levin of XYO has pointed to a concrete example: when the COIN App was translated into Amharic—spoken by 57 million people in Ethiopia—ChatGPT’s translations were riddled with errors. Not because the model was broken, but because Ethiopia was not a priority data market. The training signal simply wasn’t there.
AI trained on high-quality, verified, structured data has produced breakthroughs in science and academia that seemed impossible a decade ago—protein folding, drug discovery, climate modeling. The models evolved, but so did the data. The two are inseparable. Better data is not a nice-to-have for AI. It is the single highest-leverage input in the stack.
The Case Against Blockchain as a Data Fix
Not everyone is convinced that decentralized infrastructure is the right answer. The counterarguments deserve a fair hearing.
Retrieval-Augmented Generation (RAG) already helps. Many AI deployments now use RAG pipelines—anchoring model responses to retrieved documents rather than relying on baked-in training data. This reduces hallucinations significantly in enterprise contexts. Stanford’s 2025 legal RAG reliability work showed meaningful gains in accuracy for grounded tasks. The argument: you don’t need a new blockchain; you need better retrieval architecture.
Benchmark reform may be sufficient. OpenAI’s own paper proposes a targeted fix: change how accuracy-only scoreboards are weighted so that abstaining scores better than confident wrong answers. If the field adopts uncertainty-aware evaluation, the training incentive to guess diminishes—without needing any new data infrastructure at all.
Blockchain adds complexity without guaranteeing quality. Cryptographic proof of origin tells you where data came from, not whether what was recorded was true. A network of nodes can corroborate a false reading just as easily as a true one if the sensors or participants are compromised. Garbage In, Garbage Out applies to DePIN systems too.
RAG, better benchmarks, and RLHF-based fine-tuning are all genuine improvements and are already reducing hallucination rates in production systems. None of them, individually or in combination, has yet eliminated the problem—particularly for low-resource languages, niche domains, and real-time physical data. That gap is where verified data infrastructure makes its case.
The honest answer is that these approaches are complementary, not competing. Better benchmarks fix the training incentive. RAG grounds outputs in documents. Verified on-chain data grounds those documents in reality. Each layer addresses a different failure mode. The question isn’t which one wins—it’s which combination gets AI closest to reliable.
Who Else Is Building the Data Layer
XYO is not alone in recognizing that verified, decentralized data infrastructure is a missing piece of the AI stack. A cluster of projects has been converging on the same thesis from different angles.
Ocean Protocol
Ocean Protocol focuses on data marketplaces—enabling individuals and organizations to publish, share, and monetize datasets while maintaining control. Its model addresses a different angle of the same problem: not just verifying data provenance, but creating economic incentives for high-quality data contributors to participate in the first place. For AI training, a well-structured data marketplace with verified provenance is a meaningful step toward cleaner inputs.
Chainlink
Chainlink’s oracle network is arguably the most battle-tested decentralized data verification layer in production. Its core function is bridging off-chain real-world data—price feeds, weather data, sports scores, financial events—onto blockchains in a tamper-resistant way. While Chainlink’s primary use case has been DeFi smart contracts, the infrastructure directly applies to AI: verified, real-time external data feeds that a model can query with known provenance rather than guessing from training data.
Filecoin & The Storage Layer
Filecoin approaches the problem from persistence: decentralized, verifiable storage of data at scale. If AI training datasets can be stored and retrieved from a verifiable, censorship-resistant layer, it becomes harder to corrupt or quietly alter training inputs over time. Combined with provenance tracking, decentralized storage is a foundational piece of any serious verified-data architecture.
Ceramic Network
Ceramic is building a decentralized data streaming protocol—a layer for mutable, user-controlled data that remains verifiable across applications. Where most blockchain data is static, Ceramic enables dynamic, updateable data streams with identity and provenance attached. For AI applications that need fresh, real-world signals rather than stale training snapshots, this is an important architectural piece.
These projects approach data verification from different angles—marketplaces, oracles, storage, streaming—but they share a core conviction: that unverified, centrally curated data is a structural weakness in the AI stack, and that decentralized infrastructure can address it in ways that no single company controlling its own data pipeline can. The field is early, fragmented, and competitive. But the direction is coherent.
From Raw Reality to Verified Truth
XYO and the Case for Physical Data Verification
Among the projects building verified data infrastructure, XYO occupies a specific and significant niche: real-world physical data. Founded in 2018 as the first DePIN project, its network now spans over 10 million nodes—smartphones, IoT sensors, and edge devices—across nearly every country on earth. Each node participates in a process called bound witnessing, where multiple independent nodes corroborate the same physical event cryptographically, making any single data point extremely difficult to falsify.
In September 2025, XYO launched XYO Layer One—the first blockchain designed from the ground up for data-heavy industries. After seven years of operations, the team concluded no existing chain could handle the volume, latency, and validation requirements of real-world physical data at scale. So they built their own, with a dual-token model: $XYO for governance and staking, $XL1 for gas and transactions.
Real Deployments, Real Stakes
What distinguishes XYO from many blockchain infrastructure plays is that it has been generating revenue and real-world traction for years before the AI data conversation caught up to it. The network generated $8.8 million in revenue in 2024, with 80% of users outside the crypto ecosystem entirely. In March 2026, XYO partnered with climate analytics firm Resiliocs to provide cryptographic verification for environmental and geospatial data used in climate risk modeling—an application where data accuracy is legally and financially material. In December 2025, Revolut, with a reported $75 billion valuation, listed $XYO—the first major fintech to add a DePIN token.
What XYO is specifically good at—physical location verification, real-world event attestation, Proof of Origin for sensor data—is exactly the category of data that current AI training pipelines handle worst. Language models can approximate text. They cannot approximate the physical world. That is the gap XYO is positioned to fill.
XYO is not trying to solve every dimension of the hallucination problem. It is building the physical data verification layer that projects like Ocean Protocol (marketplaces), Chainlink (oracles), and Filecoin (storage) don’t specialize in. In a mature verified-data ecosystem, these layers are complementary. XYO’s edge is the depth and scale of its physical-world node network—10 million strong, built over seven years, before anyone was calling it AI infrastructure.
The Data Layer Is the Unlock
The AI industry has spent enormous resources making models larger, smarter, and better at reasoning. Those investments have paid off. But the hallucination problem has persisted because its root cause was never primarily about model architecture. It was about incentive structures and data quality—two things that no amount of additional compute directly fixes.
Benchmark reform, RAG pipelines, and fine-tuning are real improvements that are already helping in production systems. But they operate on top of a data foundation that remains largely unverified, unaudited, and biased toward the markets that happened to generate the most internet text. That foundation is what the DePIN and blockchain data layer is trying to fix.
It is early. The ecosystem is fragmented. Chainlink, Ocean Protocol, Filecoin, Ceramic, and XYO are each approaching one corner of a large problem, and none of them has yet become the dominant infrastructure standard for AI data verification. That race is still open.
Reliable data is now the most valuable resource, yet the infrastructure to verify and process it on-chain still did not exist. That is why we built XYO Layer One from the ground up.— Markus Levin, Co-Founder, XYO · BeInCrypto, October 2025
What is no longer early is the problem itself. Hallucination rates of 6–16% across frontier models are not acceptable for high-stakes applications. The costs—fabricated legal citations, corrupted medical data, biased outputs in underserved communities—are real and documented. The question for AI’s next decade is not whether models will get more capable. They will. The question is whether the data underneath them can be trusted.
The answer starts before the model. It starts with the data. And the infrastructure to verify that data at global scale is being built right now—by a field that, until recently, was mostly ignored by the AI conversation. That’s changing fast.
