An advanced AI model outperforms medical diagnosis in a study using clinical cases and A&E data

The use of artificial intelligence (AI) in medical diagnosis centres on computing and data processing. Research published in Science assesses the diagnostic capabilities of an advanced large language model, which managed to match or outperform human professionals. The team carried out six experiments involving both standardised clinical cases and a study using real cases from emergency department records, using the performance of hundreds of doctors as a benchmark. The AI proved particularly useful in situations of uncertainty, such as the initial stages of triage in the emergency department. However, the authors highlight that the model only processed text, whereas clinical practice also relies on visual and auditory cues.

 

30/04/2026 - 20:00 CEST
Expert reactions

Ignacio Miranda - IA diagnóstico médico EN

Ignacio Miranda Gómez

Head of the Breast Imaging Unit at the International Breast Cancer Center (IBCC) and at the Teknon Medical Center in Barcelona.

Science Media Centre Spain

The study examines whether an advanced language model (LLM) can perform clinical reasoning tasks at the same level as doctors. The main finding is that the model matches or outperforms professionals in various tests, including in some real-life emergency situations.

To evaluate it, the researchers compared the model with hundreds of doctors across six types of tasks: diagnosis in complex cases, explanation of clinical reasoning, treatment decisions, classic diagnostic cases, probability estimation and real-life emergency situations.

The results show very high performance: the model correctly diagnoses in the majority of cases (up to almost 98% when including near-miss diagnoses), correctly selects medical tests, achieves near-perfect scores in clinical reasoning, and outperforms doctors in treatment decisions. It also shows comparable or superior performance in emergency situations, particularly in the early stages when information is scarce.

However, the study has significant limitations: it is text-based only, uses cases that are more structured than real-world practice (more ‘clean’ cases), does not cover all areas of medicine, and does not replace comprehensive clinical judgement.

In conclusion, these models already exceed many classic standards of medical reasoning and could improve diagnosis and decision-making. Even so, they need to be validated in real-world settings and we need to define how to integrate them safely.

The central idea is not to replace the doctor, but to use AI as a powerful support tool, particularly in complex or uncertain situations.

The study is of high quality. It is well designed, compares directly with doctors, includes different types of tests and even real-life emergency cases. Even so, it is not definitive evidence but a solid demonstration of capability under controlled conditions.

As I said, it has some significant limitations. It only analyses text (without physical examination or imaging), uses cases that are more orderly than in real clinical practice and does not measure whether it improves patient outcomes. Furthermore, the comparison with doctors is somewhat artificial and does not delve into critical errors. In short, it assesses theoretical performance rather than actual clinical practice.

In terms of implications, it confirms that AI is already competitive in medical cognitive tasks and improves upon what has been seen in previous studies. However, real clinical trials, safety validation and evidence of patient impact are still needed before it can be widely adopted.

As I mentioned, the most realistic integration is not to replace doctors, but to use AI as a support for a second opinion, an alert system, reasoning aid and triage support, especially in high-pressure situations with limited information. The key is to use it as a ‘co-pilot’, not autonomously.

The doctor’s role changes, but remains essential. Less emphasis will be placed on memorising or listing diagnoses, and more on integrating complex information, making decisions, dealing with patients and supervising the AI. Overall, the most likely scenario is that the doctor + AI combination will clearly outperform either one on its own.

The author has declared they have no conflicts of interest
EN

2024 04 30 IA diagnóstico Ewen Harrison EN

Ewen Harrison

Professor of Surgery and Data Science and Co-Director Centre for Medical Informatics, University of Edinburgh

Science Media Centre UK

This is an important study showing that modern AI systems can be good at one of the central tasks of doctors and nurses: taking the information available about a patient and suggesting which diagnoses should be considered.

This matters - these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important.

But this does not mean AI should be quickly ushered into clinical care without limits. Producing a good list of possible diagnoses is not the same as improving patient care. We still need studies showing that these tools help doctors and nurses make better decisions, reduce harm, avoid unnecessary tests, and work safely in busy hospitals and GP practices.

This study moves the field forward, but it does not by itself change clinical practice. The responsible route is not to ban these systems, but also not to let them drift into casual use. They should be tested in real clinical settings, used as second-opinion tools rather than replacements for clinicians, and monitored against the outcomes that actually matter to patients: better, safer, quicker care.

Conflict of interest: “The senior authors and I are editors at NEJM AI.”

EN

2024 04 30 IA diagnóstico Wei Xing EN

Wei Xing

Assistant professor in the University of Sheffield’s School of Mathematical and Physical Sciences

Science Media Centre UK

This is one of the largest evaluations of LLMs in clinical reasoning to date, and the inclusion of real emergency department data is a genuine step forward. Two findings in the paper, however, deserve more scrutiny than they received. In one management reasoning experiment, physicians using GPT-4 scored 41%, no better than GPT-4 alone at 42% and well above physicians without AI at 34%, suggesting that doctors may unconsciously defer to the AI's answer rather than thinking independently. This tendency could grow more significant as AI becomes more routinely used in clinical settings.

The real-world data from 76 patients at a single elite academic centre tells a more nuanced story than the headline implies: o1 identified the correct diagnosis in 67% of triage cases against 55% and 50% for the two attending physicians, a genuine gap, but one with no accompanying analysis of where or for whom the model fails. Whether errors concentrate among elderly patients, non-English speakers, or those with atypical presentations remains entirely unknown, and without that analysis a strong average accuracy offers limited reassurance. What this study demonstrates is that an LLM can outperform physicians on structured, text-based reasoning tasks under controlled conditions. It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice.

The author has not responded to our request to declare conflicts of interest
EN
Publications
Journal
Science
Publication date
Authors

Peter G. Brodeur et al. 

Study types:
  • Research article
  • Peer reviewed
The 5Ws +1
Publish it
FAQ
Contact