Google Reveals That Leading AI Chatbots Still Fail 30% Of The Time

Google revealed artificial intelligence (AI) chatbots achieved an accuracy rate of 69% at best, according to a recent assessment. The company utilized its new FACTS Benchmark Suite to test the factual reliability of various AI models.

This benchmark has quantified a critical gap in AI performance, indicating that even leading models frequently produce incorrect information. For sectors such as finance, healthcare, and law, this inaccuracy presents significant risks where erroneous yet confidently delivered responses could lead to substantial damage.

The FACTS Benchmark Suite, developed by Google’s FACTS team in collaboration with Kaggle, specifically evaluates factual accuracy across four real-world application areas:

Parametric Knowledge: This tests a model’s ability to answer fact-based questions using only its pre-trained knowledge.
Search Performance: This assesses how effectively models leverage web tools to retrieve accurate information.
Grounding: This measures whether a model adheres to provided documents without introducing false details.
Multimodal Understanding: This examines the model’s accuracy in interpreting charts, diagrams, and images.

The assessments highlighted considerable performance differences among models. Gemini 3 Pro achieved the highest FACTS score at 69%. Gemini 2.5 Pro and OpenAI’s ChatGPT-5 followed with approximately 62%. Claude 4.5 Opus scored around 51%, and Grok 4 registered approximately 54%. Multimodal tasks consistently represented the weakest area, often showing accuracy levels below 50%. This lower performance in understanding visual data, such as sales graphs or numerical information from documents, poses a risk of critical errors that are difficult to detect or rectify.

Google stated that while AI technology continues to improve, it requires human oversight, verification, and robust guardrails before users can treat it as a reliable source of truth.

Featured image credit