Wei Xing
Assistant professor in the University of Sheffield’s School of Mathematical and Physical Sciences
This is one of the largest evaluations of LLMs in clinical reasoning to date, and the inclusion of real emergency department data is a genuine step forward. Two findings in the paper, however, deserve more scrutiny than they received. In one management reasoning experiment, physicians using GPT-4 scored 41%, no better than GPT-4 alone at 42% and well above physicians without AI at 34%, suggesting that doctors may unconsciously defer to the AI's answer rather than thinking independently. This tendency could grow more significant as AI becomes more routinely used in clinical settings.
The real-world data from 76 patients at a single elite academic centre tells a more nuanced story than the headline implies: o1 identified the correct diagnosis in 67% of triage cases against 55% and 50% for the two attending physicians, a genuine gap, but one with no accompanying analysis of where or for whom the model fails. Whether errors concentrate among elderly patients, non-English speakers, or those with atypical presentations remains entirely unknown, and without that analysis a strong average accuracy offers limited reassurance. What this study demonstrates is that an LLM can outperform physicians on structured, text-based reasoning tasks under controlled conditions. It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice.