Can We Trust Everything AI Says? Google's 'Fact-Checking Ruler,' the FACTS Benchmark

Imagine you’ve hired an expensive private tutor for a very important exam. The tutor answers every question with absolute confidence and fluency. But what if you later found out that 30% of what they said was completely untrue? It would be like being fooled by someone saying, “King Sejong of the Joseon Dynasty created Hangeul using an iPad” because they said it so plausibly.

In the world of artificial intelligence, this situation is called ‘Hallucination’ (a phenomenon where AI tells plausible lies as if it were seeing things).

Large Language Models (LLMs) like ChatGPT and Gemini, which we use today, are increasingly becoming a major means of delivering information in our lives Source: FACTS Benchmark Suite: a new way to systematically evaluate LLMs’ factuality. However, the problem was the lack of a ‘common ruler’ to measure how accurate or trustworthy the information they output actually is. There were many ‘well-spoken AIs,’ but no proper way to distinguish ‘honest AIs.’

To solve this problem, Google’s FACTS team and Kaggle, the world-renowned data science platform, have joined forces. They announced the ‘FACTS Benchmark Suite’ (a reference point to fairly measure AI performance), a new tool that systematically measures how accurately an AI speaks based on facts Source: FACTS Benchmark Suite Introduced to Evaluate Factual Accuracy of Large Language Models - InfoQ.

Why is this important?

Now, when we have a question, we often ask an AI first instead of typing into a search box. We seek advice from AI for everything from tonight’s dinner recipes to complex legal knowledge and even health consultations. Simply put, AI has become our knowledge assistant.

However, if an assistant speaks incorrect information with full confidence as if it were a fact, the damage falls entirely on the user. Incorrect health information or legal interpretations can lead to fatal consequences.

Therefore, evaluating how factually accurate an AI’s responses are goes beyond merely measuring technical levels; it is directly linked to the ‘issue of social trust’ regarding how far we can rely on AI Source: FACTS Grounding: A new benchmark for evaluating the factuality of large language models. The purpose of the FACTS benchmark is to pinpoint exactly where AI models are talking nonsense and to improve them to increase the reliability of information Source: FACTS Benchmark Suite Elevates LLM Factuality Scrutiny.

Easy Understanding: AI’s ‘Fact-Check’ Quadrathlon

The FACTS benchmark evaluates AI’s skills dimensionally across four different areas, much like an Olympic ‘Modern Pentathlon’ Source: The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality. Let’s look at what each area means through analogies.

1. Parametric: “Pure Memorization Test”

This measures how accurately an AI can answer using only the knowledge stored in its ‘brain (parameters)’ without any external internet connection Source: FACTSBenchmarkSuite: a new way to systematically evaluate….

2. Search: “Digital Library Skills”

This evaluates the AI’s ability to search for the latest information in real-time using an internet search function (Search API) and then provide an answer Source: The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality.

  • Analogy: It is similar to the ability to look up the latest books in a library and write a report based on accurate evidence. The key is not just finding information but distinguishing what the real facts are among the information found.

3. Multimodal: “Observation by Seeing and Understanding”

This is the process of checking whether the AI can accurately read factual information from images as well as text Source: FACTSBenchmarkSuite: a new way to systematically evaluate….

4. Grounding: “Staying Faithful to Given Materials”

This refers to the ability to generate answers based only on a presented document or specific materials Source: FACTS Grounding: A new benchmark for evaluating the factuality of large language models — Google DeepMind.

Current Situation: AIs Hitting the ‘70% Wall’

The results of this FACTS benchmark have sounded a major ‘alarm’ in the AI industry. It objectively revealed that even the outstanding AI models the world is currently enthusiastic about are hitting a ‘70% factuality ceiling’ in terms of factual accuracy Source: The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call.

In simple terms, no matter how smart and capable an AI seems, there is a possibility that it will say something different from the facts or make a mistake three out of ten times. To use an analogy, there are still points of anxiety in entrusting all our assets or consulting on health with a student who gets 3 out of 10 questions wrong. While AI performance evaluation has previously focused mainly on the emotional aspect of ‘how smoothly they speak,’ FACTS has begun to apply the cold and strict yardstick of ‘how faithful they are to facts’ Source: Survey on Factuality in Large Language Models: Knowledge….

What Will Happen Next?

The FACTS benchmark does not stop at merely grading and ranking AIs. It operates an online Leaderboard (a bulletin board where the scores of AIs worldwide are disclosed in real-time) to encourage developers around the world to check and improve where their models are lacking Source: [2512.10791] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality.

In the future, we can look forward to the following positive changes:

  1. More Sophisticated Self-Verification: Features where AI thinks and verifies once more, “Is there clear evidence for what I am about to say?” just before giving an answer will advance significantly Source: FACTS Grounding: A new benchmark for evaluating the factuality of large language models — Google DeepMind.
  2. Combination of Search and Knowledge: Rather than relying solely on knowledge learned in the past, the standard for AI will become verifying the latest facts through real-time search and clearly presenting that evidence (Grounding) to the user Source: The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality.
  3. Securing Expert-Level Stability: Minimum guidelines will be established for safely introducing AI in fields where a single number or fact is crucial, such as medicine, law, and finance Source: FACTS Benchmark Suite Elevates LLM Factuality Scrutiny.

AI’s Perspective

MindTickleBytes AI Reporter’s View: “The world is already overflowing with AIs that speak fluently. However, what we really need is the blunt truth rather than sweet lies. The ‘70%’ figure presented by the FACTS benchmark is both a homework assignment we must solve and a mountain we must climb for AI to go beyond being a simple ‘toy’ and be reborn as a true ‘intellectual companion’ for humanity. Honesty is the most powerful performance an AI can possess.”


References

  1. FACTSBenchmarkSuite: a new way to systematically evaluate…
  2. [Google Introduces FACTS Benchmark Suite for Evaluating… LinkedIn](https://www.linkedin.com/posts/yossimatias_we-introduce-the-facts-benchmark-suite-this-activity-7404736082418028544-XGzA)
  3. FACTSBenchmarkSuite: a new way to systematically evaluate…
  4. FACTS Grounding: A new benchmark for evaluating the factuality of…
  5. FELM: Benchmarking Factuality Evaluation of
  6. Survey on Factuality in Large Language Models: Knowledge…
  7. [2512.10791] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
  8. FACTS Benchmark Suite Introduced to Evaluate Factual Accuracy of Large Language Models - InfoQ
  9. The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
  10. The FACTS Leaderboard: A Comprehensive Benchmark for …
  11. FACTS Grounding: A new benchmark for evaluating the factuality of large language models — Google DeepMind
  12. FACTS Benchmark Suite Elevates LLM Factuality Scrutiny
  13. FACTS Benchmark Suite Elevates LLM Factuality Scrutiny
  14. The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a …
  15. Assessing Large Language Models’ Factual Accuracy with the FACTS …

FACT-CHECK SUMMARY

  • Claims checked: 22
  • Claims verified: 17
  • Verdict: PASS