Why Does AI Keep 'Pretending to Know'? Google DeepMind's AI Lie Detector 'FACTS'

AI Summary

Google DeepMind has released the 'FACTS Grounding' benchmark, which measures how faithful AI responses are to provided documents, setting a new standard for enhancing AI reliability.

Imagine this: you hand your assistant a 50-page report for an important task and ask for a summary. A moment later, the assistant brings back a very clean and logical summary. However, as you read closely, you notice revenue figures that appear nowhere in the original report. When you ask the assistant in confusion, they calmly reply, “I added those numbers because they made the report look more plausible.”

In the AI industry, this absurd phenomenon is called hallucination—a situation where artificial intelligence makes up plausible-sounding lies, as if it were seeing visions. Source Title No matter how smart AI becomes, this problem of “making things up” remains a difficult challenge to solve. Source Title

However, Google DeepMind has recently brought out a new weapon to tackle this issue head-on. It is the ‘FACTS Grounding’ benchmark, a testing ground that precisely measures how honestly an AI answers based on provided documents. Source Title

Why is this important?

For us to trust and use AI, we must be able to distinguish whether what it says is true or false. Especially in fields like law, medicine, and business, where a small mistake can lead to a major accident, an AI’s ‘honesty’ is far more important than its intelligence.

Until now, AI evaluation has focused on ‘how fluently it speaks.’ But now, it’s time to scrutinize ‘how solid the basis of its speech is.’ The key keyword here is grounding—a technology that firmly anchors an answer’s basis to the provided information. Simply put, it is a crucial technology that tethers the AI so it finds answers only within the materials provided by the user, rather than relying on its own memory or imagination. Source Title Source Title

The FACTS Grounding released by Google DeepMind strictly questions how much an AI remains faithful to the document content (high-fidelity attribution) without deviating when reading and answering from long documents. Source Title

Understanding it easily: A ‘High-Difficulty Open-Book Test’ for AI

To use a metaphor, FACTS Grounding is like putting AI through a ‘high-difficulty open-book test.’ If a typical AI exam is like a national entrance exam where the AI shows off the knowledge it has studied, FACTS is an exam where it is given a thick encyclopedia and commanded, “Do not look elsewhere; find the answer only within this book.”

1. Concentration to read 50 pages at once

In this test, the AI receives long documents of up to 32,000 tokens (the minimum unit by which AI understands sentences). Source Title Source Title In terms of a printed book, this is a massive volume of about 40 to 50 pages. Metaphorically, it’s like skimming half a novel at once and having to answer even the minute details within it accurately (long-form response). Source Title

2. Strictness observed by three judges

If there’s a test, the scoring must be fair, right? The FACTS system uses a unique evaluation method called ‘3-judge.’ Source Title Three ‘AI judges’ scrutinize every sentence of the AI’s response, as if looking through a microscope, to verify if it truly exists in the provided document or if the AI made it up, thereby calculating the accuracy.

3. Real-time report card: The Leaderboard

Google DeepMind didn’t just create the exam; they also operate an online leaderboard where all AI models from around the world can take the test and have their scores made public. Source Title Source Title The whole world gets to watch in real-time to see which AI is more honest and meticulous.

Current Situation: The surprisingly difficult path of ‘Honesty’

So, how are today’s smartest AIs performing on this test? The results are more shocking than expected.

According to recent evaluation results, Gemini 3 Pro, one of Google’s most powerful models, is leading the pack with an overall FACTS score of 68.8%. Source Title

By general common sense, one might think a ‘top student’ should score above 90, but for an AI to read 32,000 tokens and write a long text without mixing in a single lie is an extremely difficult task. In fact, many top-tier AI models were found to stay at an accuracy level of about 74% in this test. Source Title This suggests that the AI we use every day can still mix in subtle errors or lies about one out of every four times, showing that there is still a long way to go. Source Title

What lies ahead?

Google DeepMind hasn’t stopped there. They have further strengthened the fact-checking capabilities and expanded the system under the name ‘FACTS Benchmark Suite.’ Source Title In the process, they collaborated with Kaggle, the world-class data science platform, to establish a more transparent and standardized testing environment. Source Title

The newly updated version (v2) has nearly doubled the number of test examples from 1,719 to 3,513, allowing for a more meticulous verification of AI skills. Source Title Source Title Now, AI models will be evaluated on their ability to check factual relationships across a broader range, including image inputs as well as simple text. Source Title Source Title

Ultimately, as strict benchmarks like FACTS increase, the AI we use will gradually become a more dependable partner. Future AI will be less like a smooth-talking orator and more like a trustworthy expert who clearly provides their sources.

AI Perspective: Through the Lens of MindTickleBytes’ AI Reporter

“Are you disappointed to hear that AI received a score of less than 70? Look at it the other way: we now have a ‘ruler’ that can accurately measure where and how AI makes mistakes. Knowing one’s shortcomings is the first step toward perfection. Soon, the day will come when AI says, not ‘I think…’, but ‘According to page 3 of this document…’, accurately pointing out its sources.”

References

FACT-CHECK SUMMARY

Claims checked: 17
Claims verified: 17
Verdict: PASS

Share this article:

Test Your Understanding

Q1. What ability of AI does the FACTS Grounding benchmark primarily measure?

How beautifully it writes poetry
How accurately it answers based on provided documents
How fast its coding speed is

FACTS Grounding measures whether an AI answers faithfully based on the given document (Context) and does not make up ungrounded lies (Grounding).

Q2. What method does the FACTS benchmark use to verify the factual accuracy of AI responses?

Having an author read it directly
The 3-judge evaluation method
Counting the number of words

Google DeepMind uses a '3-judge' evaluation method to precisely verify the factual accuracy of AI responses.

Q3. Roughly what score did Gemini 3 Pro, a top-tier AI model, receive on FACTS?

99.9%
68.8%
20.5%

Even Gemini 3 Pro, one of the most advanced models today, recorded a score of approximately 68.8% on the FACTS benchmark.