Ignacio Miranda Gómez
Head of the Breast Imaging Unit at the International Breast Cancer Center (IBCC) and at the Teknon Medical Center in Barcelona.
The study examines whether an advanced language model (LLM) can perform clinical reasoning tasks at the same level as doctors. The main finding is that the model matches or outperforms professionals in various tests, including in some real-life emergency situations.
To evaluate it, the researchers compared the model with hundreds of doctors across six types of tasks: diagnosis in complex cases, explanation of clinical reasoning, treatment decisions, classic diagnostic cases, probability estimation and real-life emergency situations.
The results show very high performance: the model correctly diagnoses in the majority of cases (up to almost 98% when including near-miss diagnoses), correctly selects medical tests, achieves near-perfect scores in clinical reasoning, and outperforms doctors in treatment decisions. It also shows comparable or superior performance in emergency situations, particularly in the early stages when information is scarce.
However, the study has significant limitations: it is text-based only, uses cases that are more structured than real-world practice (more ‘clean’ cases), does not cover all areas of medicine, and does not replace comprehensive clinical judgement.
In conclusion, these models already exceed many classic standards of medical reasoning and could improve diagnosis and decision-making. Even so, they need to be validated in real-world settings and we need to define how to integrate them safely.
The central idea is not to replace the doctor, but to use AI as a powerful support tool, particularly in complex or uncertain situations.
The study is of high quality. It is well designed, compares directly with doctors, includes different types of tests and even real-life emergency cases. Even so, it is not definitive evidence but a solid demonstration of capability under controlled conditions.
As I said, it has some significant limitations. It only analyses text (without physical examination or imaging), uses cases that are more orderly than in real clinical practice and does not measure whether it improves patient outcomes. Furthermore, the comparison with doctors is somewhat artificial and does not delve into critical errors. In short, it assesses theoretical performance rather than actual clinical practice.
In terms of implications, it confirms that AI is already competitive in medical cognitive tasks and improves upon what has been seen in previous studies. However, real clinical trials, safety validation and evidence of patient impact are still needed before it can be widely adopted.
As I mentioned, the most realistic integration is not to replace doctors, but to use AI as a support for a second opinion, an alert system, reasoning aid and triage support, especially in high-pressure situations with limited information. The key is to use it as a ‘co-pilot’, not autonomously.
The doctor’s role changes, but remains essential. Less emphasis will be placed on memorising or listing diagnoses, and more on integrating complex information, making decisions, dealing with patients and supervising the AI. Overall, the most likely scenario is that the doctor + AI combination will clearly outperform either one on its own.