Reacción a "Two AI models demonstrate their potential for patient management using simulations and real-world data"

Wei Xing

Assistant professor in the University of Sheffield’s School of Mathematical and Physical Sciences

On the AIME paper:
This is a methodologically careful study. The design is randomised and blinded, and the statistical corrections for multiple comparisons are done properly. But this result needs context. This is the third major paper from this group on AMIE. The most recent prior study tested AMIE with real patients. In that study, doctors produced more practical and more cost-effective care plans than AMIE did. This new paper goes back to a fully simulated setting, and it does not address that earlier finding. Its strong results here should be read against that background. There is also a question about where AMIE's advantage actually comes from. On one of the benchmarks in this paper, general purpose AI models with no special clinical training scored similarly to AMIE. This suggests AMIE's edge may reflect the rapid general progress of AI models, more than the specific system built around it. AMIE is tested on scripted patient actors, communicating only through text. The authors are clear that it is not ready for clinical use, and this setup is quite different from how doctors actually work with patients.

On the MIRA paper:
This study is also careful, and a strength compared to the AMIE paper is that it uses real historical patient records rather than scripted scenarios, with extensive additional safety checks. But the headline figure, that the AI beat doctors on diagnostic accuracy, is mostly driven by conditions with clear test results, like appendicitis and pancreatitis. For pneumonia and urinary tract infections, two of the most common reasons people go to emergency departments, both the AI and the doctors did worst, and the gap between them was smallest. The AI also ordered roughly twice as many blood tests as the doctors did. More information could itself explain higher accuracy, so this is not quite a level comparison. This is a retrospective simulation using old patient records. It did not involve real patients, real time clinical settings, or interaction with practising doctors. It cannot tell us yet how this would perform in an actual hospital.

Language EN