AI models are still not reliable for unsupervised medical diagnosis
A team from the United States has analysed the performance of 21 large artificial intelligence (AI)-based language models—including ChatGPT, Gemini and Grok—for clinical diagnosis. Their conclusions are that, despite advances in these models, their reasoning capabilities remain limited for initial diagnosis and that they should not be relied upon without the supervision of a medical professional. According to the authors, who published their findings in JAMA Network Open and aimed to “help distinguish reality from hype in the use of these tools”, the results “reinforce the idea that language models in healthcare still require human intervention and very rigorous supervision”.
Susana Manso - IA salud EN
Susana Manso García
General practitioner, member of the Artificial Intelligence and Digital Health Working Group of the Spanish Society of Family and Community Medicine (semFYC)
Overall, the article demonstrates high methodological quality and notable clinical relevance, although it is not without limitations that should be borne in mind.
Published in JAMA Network Open, it is based on a systematic, transparent and well-structured design.
One of its strengths is the inclusion of 21 state-of-the-art language models, enabling a comprehensive and up-to-date comparison of the current state of the art. Furthermore, the use of realistic clinical cases from the MSD Manual provides an approach that is closer to real-world clinical practice than traditional multiple-choice tests. Added to this is the introduction of an innovative metric, the PrIME-LLM index, which does not merely measure final accuracy but evaluates clinical reasoning in a multidimensional way. The volume of data analysed—more than 16,000 responses evaluated—reinforces the robustness of the results. However, as this is a cross-sectional and experimental study, it does not allow conclusions to be drawn regarding the actual impact on patients or in real clinical settings.
As for how it fits with previous evidence, the study confirms and expands on existing findings regarding language models in medicine. Previous studies had shown that these systems can achieve good results in USMLE-style tests, which generated some optimism regarding their clinical potential. However, this article qualifies that view by demonstrating that good performance on closed-ended questions does not necessarily translate into sound clinical reasoning. In fact, it highlights significant weaknesses previously noted, such as hallucinations, difficulty in handling uncertainty, and a tendency to offer conclusions without adequately justifying the process.
The most significant finding is that, although the models can perform reasonably well in final diagnosis and management proposals, they fail significantly in one of the most critical phases of medical reasoning: the formulation of the differential diagnosis.
This point has important implications. On the one hand, it directly challenges the idea of using these systems as autonomous diagnostic tools. On the other, it reinforces a more cautious approach, in which language models are used to support healthcare professionals, particularly in structured tasks or those involving a lower degree of uncertainty. Furthermore, the proposed PrIME-LLM framework paves the way for more comprehensive future evaluations and could contribute to the development of regulatory standards in this field. In a sense, the article acts as a counterpoint to the enthusiasm based on simplistic metrics and directs attention towards what truly matters in clinical practice: the reasoning process.
However, it is important to interpret the results in light of their limitations. The study is conducted in an experimental setting based on clinical vignettes, which means that fundamental aspects of real-world practice, such as doctor-patient interaction or the contextual complexity of cases, are not assessed. There is also a possibility of data contamination, as the cases used are publicly available and may have been part of the models’ training. Furthermore, the systems were evaluated without additional optimisation, that is, without access to external tools, clinical databases or support systems, which could underestimate their performance in more integrated real-world settings. The evaluation of the responses was carried out by medical students, which introduces a certain degree of subjectivity. Finally, the PrIME-LLM metric, although promising, is still recent and has not been widely validated.
From the public’s perspective, the message emerging from the study must be clear and balanced. Language models have significant potential in healthcare: they can help explain medical information, organise data or serve as a support for professionals. However, they are not yet reliable as substitutes for doctors, especially in complex situations or at the time of initial diagnosis. The study itself stresses that they should not be used to make clinical decisions without supervision. Therefore, although artificial intelligence represents a promising tool, human clinical judgement remains essential. The recommendation for the public is to use these technologies with caution and, in the event of any health problem, always consult a healthcare professional.
In short, this is a robust and well-designed study that provides relevant evidence at a crucial juncture. Its main contribution is to demonstrate that, despite advances, language models still have significant limitations at the core of clinical reasoning, particularly in managing uncertainty and generating differential diagnoses. This has direct implications for their clinical use, their regulation, and the way society perceives their role in medicine.
Rao et al.
- Research article
- Peer reviewed