The language models used by tools such as ChatGPT fail to identify users' erroneous beliefs
Large language models (LLMs) do not reliably identify people's false beliefs, according to research published in Nature Machine Intelligence. The study asked 24 such models – including DeepSeek and GPT-4o, which uses ChatGPT – to respond to a series of facts and personal beliefs through 13,000 questions. The most recent LLMs were more than 90% reliable when comparing whether data was true or false, but they found it difficult to distinguish between true and false beliefs when responding to a sentence beginning with ‘I believe that’.
Carlos Carrasco - creencias usuarios EN
Carlos Carrasco-Farré
Lecturer at Toulouse Business School (France), member of the editorial team at PLoS ONE (Social Sciences) and Doctor of Management Sciences (ESADE Business School)
I find this an interesting and necessary paper: it shows that AI can be right and still be wrong. Correcting false information is fine; the problem arises when the objective is to recognise the speaker's belief and the model avoids it with a premature fact check. If I say, “I believe that X”, I first want the system to register my state of mind and then, if appropriate, to verify the fact. This confusion between attributing beliefs and verifying facts is not a technicality: it is at the heart of critical interactions in medical consultations, in court or in politics. In other words, AI gets the data right, but fails the person.
What is interesting (and worrying) is how easily this social myopia is triggered: it is enough for the belief to be in the first person for many models to be wrong. This forces us to rethink the guidelines for use in sensitive contexts: first, recognise the state of mind; then, correct. This is a design alert for responsible AI. My interpretation is that this work does not demonise the models, but reminds us that if we want safe and useful AI, we must teach it to listen before we teach it to educate. And that means redesigning prompts, metrics and deployments with one simple rule: first, empathy; then, evidence.
Conflict of interest statement: ‘I have not participated in this study nor have I received any related funding. I have occasionally collaborated in the risk assessment of AI systems for large organisations in the sector, some of which are included in the study sample, but these activities are unrelated to this work and do not affect my assessment.’
Josep Curto - creencias LLM EN
Josep Curto
Academic Director of the Master's Degree in Business Intelligence and Big Data at the Open University of Catalonia (UOC) and Adjunct Professor at IE Business School
This article offers constructive and fundamental criticism of current language models, systematically exposing their epistemological limitations using the new KaBLE reference dataset. The main finding highlights a critical shortcoming: models tend to prioritise their internal factual knowledge base over recognition of the user's subjective beliefs. In sensitive applications such as mental health assessment, therapy, or legal advice, where recognition and reasoning about subjective (and potentially incorrect) beliefs are fundamental to human interaction and professional practice, this default “fact-checking” undermines effective, empathetic, and safe implementation.
The article's findings call for urgent action by both developers and implementers, in line with the principles of transparency, non-maleficence and technical robustness; and remind us that the current state of these models requires specific improvements in their ability to distinguish between subjective beliefs and objective truths before they can be considered reliable and safe for applications where these epistemic distinctions are fundamental.
Pablo Haya - creencias LLM EN
Pablo Haya Coll
Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)
The study evaluated 24 language models (such as GPT-4o, o3-min, Claude-3.7, Llama3.3, Gemini 2 Flash, and DeepSeek R1) using a new benchmark (KaBLE), which includes 13,000 questions distributed across 13 epistemic tasks. The objective was to analyse the ability of language models to distinguish between beliefs, knowledge and facts. The methodology compared the performance of the models in different epistemic tasks (verification: e.g., ‘I know that..., so it is true that...’, confirmation: e.g., ‘Does James believe that...?’, and recursive knowledge: e.g., ‘James knows that Mary knows..., it is true that...’), observing their sensitivity to linguistic markers. The results reveal significant limitations: all models systematically fail to recognise false first-person beliefs, with drastic drops in accuracy. Although the models show high accuracy in verifications with expressions that imply truth (‘I know’, direct statements), their performance declines when evaluating beliefs or statements without these markers. In general, they show difficulties in handling false statements, evidencing limitations in linking knowledge to truth.
These findings are relevant because they expose a structural weakness in language models: their difficulties in robustly distinguishing between subjective conviction and objective truth depending on how a given assertion is formulated. Such a shortcoming has critical implications in areas where this distinction is essential, such as law, medicine, or journalism, where confusing belief with knowledge can lead to serious errors in judgement. This limitation is connected to the findings of a recent OpenAI study, Why Language Models Make Things Up. This work suggests that language models tend to hallucinate because current evaluation methods set the wrong incentives: they reward confident and complete answers over epistemic sincerity. Thus, models learn to conjecture rather than acknowledge their ignorance. As a possible solution, hallucinations could be reduced by training the model to be more cautious in its responses, although this could affect its usefulness in some cases if it becomes overly cautious.
Mirac Suzgun et al.
- Research article
- Peer reviewed