Two AI models demonstrate their potential for patient management using simulations and real-world data
Nature has published two independent studies demonstrating the ability of large language models based on artificial intelligence (AI) to support different stages of patient management in controlled settings. The first study analysed MIRA, an AI agent that operates within electronic health records, which achieved a diagnostic accuracy of nearly 88%, compared with 78% for a panel of physicians. The second study evaluated AMIE, a conversational clinical reasoning model, against 21 primary care physicians across 100 multi-visit scenarios. AMIE achieved performance comparable to, and in some cases better than, that of physicians in terms of treatment accuracy, test ordering, and adherence to clinical guidelines. The models are based on simulations or retrospective data, which limits the strength of the conclusions that can be drawn. The findings are consistent with another model published in Science last April.
2026 06 17 IA pacientes Ignacio Miranda Gómez EN
Ignacio Miranda Gómez
Head of the Breast Imaging Unit at the International Breast Cancer Center (IBCC) and at the Teknon Medical Center in Barcelona.
Recent advances in medical AI indicate that the most sophisticated systems are now capable of achieving performance levels comparable to, and in some cases exceeding, those of physicians in specific clinical tasks such as diagnosis, test selection, treatment prescribing, and patient follow-up.
Two recent studies, focusing on the AMIE and MIRA systems, represent a significant qualitative leap compared with previous generations of medical AI. While AMIE stands out for its ability to conduct complex clinical conversations and manage patients across multiple consultations, MIRA takes this a step further by integrating directly into an electronic health record system and carrying out clinical actions such as ordering diagnostic tests, prescribing medications, and recommending hospital admissions.
The findings show that both systems were able to match or outperform physicians in simulated environments, particularly in areas such as adherence to clinical guidelines, the accuracy of recommendations, and medication safety.
However, the researchers themselves emphasise that these technologies are not yet ready for autonomous use in clinical practice. The studies were conducted in controlled settings using simulated patients, meaning that their effectiveness and safety must still be demonstrated in real-world hospitals and outpatient clinics.
Current evidence points towards a collaborative model in which healthcare professionals work alongside AI, rather than being replaced by it. In this scenario, AI would take on analytical, administrative, and decision-support tasks, while clinicians would remain responsible for clinical oversight, patient communication, managing uncertainty, and making final decisions regarding patient care.
These developments suggest that artificial intelligence could become a valuable ally in the coming years, helping to improve quality of care, reduce administrative burden, and support more consistent, evidence-based healthcare delivery, always under human supervision.
Neither AMIE nor MIRA is unique in this field. A recently published study in Science introduced an advanced AI model capable of outperforming physicians in diagnostic tasks within a controlled environment.
Taken together, these three studies represent three distinct generations of medical AI, and comparing them provides valuable insight into the direction in which the field is heading. If AMIE demonstrates that an AI system can interview patients like a physician, and the model reported in Science demonstrates that it can reason like a physician, MIRA seeks to demonstrate that it can work like a physician within the hospital environment. The most disruptive aspect of MIRA is not that it diagnoses conditions more effectively than other models, but that it translates clinical reasoning into structured clinical actions, such as ordering tests, prescribing treatments, scheduling procedures, and recommending admissions. From the perspective of healthcare system transformation, MIRA therefore represents what is arguably the closest step so far towards a truly integrated clinical co-pilot within hospital practice.
All three studies convey a common message: artificial intelligence is reaching performance levels that are comparable to, and in some cases exceed, those of many healthcare professionals in specific diagnostic and decision-making tasks. Nevertheless, the researchers stress that all current findings come from controlled or simulated environments, and that prospective studies involving real patients are still required to confirm safety, effectiveness, and impact on clinical outcomes.
Rather than suggesting the replacement of healthcare professionals, the authors view clinical support as the most promising role for these technologies. Under this model, AI would assume repetitive, administrative, and information-analysis tasks, while healthcare professionals would continue to be responsible for clinical oversight, final decision-making, and maintaining the human relationship at the heart of patient care.
2026 06 17 IA pacientes Alfonso Valencia EN
Alfonso Valencia
ICREA professor and director of Life Sciences at the Barcelona National Supercomputing Centre (BSC).
These two independent studies present AI systems designed for clinical patient management. Both represent significant technical advances, although they should be interpreted within their proper context: they are research developments rather than systems currently deployed in real hospitals.
MIRA is an autonomous agent that operates within a simulated electronic health record environment, capable of conducting patient interviews, ordering diagnostic tests, and proposing treatments. When evaluated on hundreds of real emergency department cases, it matched or exceeded physician performance across many of the conditions assessed, although not all of them. The second system, AMIE, is a conversational AI optimised for clinical reasoning across multiple patient visits. It also performed at a level comparable to a panel of primary care physicians, while demonstrating closer adherence to clinical guidelines and recommendations. Such orthodoxy may or may not prove advantageous in real-world settings, where flexibility and adaptation to individual cases are often just as important.
These developments can be regarded as important technical advances with the potential to improve clinical workflows and hospital processes, but they are not yet systems operating in real-world healthcare environments.
From a technical perspective, given the complexity of these systems, it is essential that they are independently evaluated and used by other researchers before confidence can be placed in the validity of the reported results. This is particularly important in relation to issues such as potential contamination between training and evaluation data, a common and serious concern in systems trained on datasets so vast that assessing their provenance and quality becomes extremely challenging. In this regard, openness is crucial. While MIRA is openly available, AMIE is not, making independent evaluation impossible and ultimately limiting the degree of trust that can be placed in its reported performance.
In any case, it is important to emphasise that we are still in the realm of research and development rather than implementation within highly complex and regulated environments such as hospitals. The current limitations remain substantial. These systems are not yet ready to interact with the full complexity of real patients, clinicians, and healthcare systems, including the many forms of interaction that extend beyond text and which are critical to everyday clinical practice.
In summary, these are important scientific publications that clearly demonstrate the rapid pace at which AI applications for medical decision-making are advancing, driven in large part by major technology companies, though not exclusively so, fortunately. Before such systems can be implemented in real healthcare settings, prospective studies involving real patients and robust ethical oversight will still be required, following the standard—and legally mandated—process applicable to any medical technology.
2026 06 17 IA pacientes Catherine Pope EN
Catherine Pope
Professor of Medical Sociology at the University of Oxford (United Kingdom)
The Ferber et al and Lievin et al papers provide welcome evidence about potential clinical uses of large language models (LLMs). It is easy to be captivated by headline claims that these kinds of LLMs ‘outperform doctors’ but the devil, as always, is in the detail. Both studies are based on simulation – Ferber et al on simulated chat created from patient notes, the second on exam formats that use actors to replicate medical scenarios for the purpose of training and assessing doctors. This is some remove from the messy, complex, human world of everyday healthcare.
“Both studies demonstrate that LLMs can mimic some aspects of experienced physician performance, but crucially both concede that while there may be promise here, much more research is needed before these LLMs can, or should, be deployed in the real world. The point - made well - in in the Ferber et al piece is that use in the real world will need to be in partnership with clinicians: these technologies are unlikely to replace doctors, and many will contend that they crucially do not and cannot substitute for the essential human aspects of care.
Conflicts of interest: “I conduct research about organisation and delivery of healthcare, and am interested in digital health care/technologies. I also co-lead the MSc Applied Digital Health. Current funded projects include a project on AI scribes (ambient voice technologies - AVTs) in general practice consultations (led by Abi Eccles and John Powell) which will involve a number of different commercial providers of AVTs and a project looking at AI assisted ‘intelligent navigation’ for same day primary care access (NIHR503515) working in partnership with Visiba Group. Previously I have explored the deployment of digital triage systems in 999 and 111 NHS services.
I am a trustee for the Foundation for Sociology of Health and Illness, Green Templeton College and the Society for Studies of Organising in healthcare.
I am an NIHR Senior Investigator and chair of the NIHR Senior Investigator Award Committee. I have served on various other NIHR funding panels and review research proposals and final reports for these.
I receive royalty payments from Wiley, Macmillan and McGrawHill (and ALCS who collect royalties on behalf of authors).“
2026 06 17 IA pacientes Midhun Parakkal Unni EN
Midhun Parakkal Unni
Academic Fellow in AI for Health at the University of Sheffield (United Kingdom)
The authors developed conversational agents for disease management, evaluated their performance in simulated scenarios, and benchmarked them against clinicians under identical conditions.
Generalising beyond the distribution on which they are trained is generally hard for machine learning systems, and it is not obvious how LLM-type foundational models behave in scenarios they haven't seen before. This makes large-scale real-world testing an absolute necessity before claiming the usefulness of the LLM-integrated systems for clinical practice.
That said, the papers are responsibly written and demonstrate an outstanding engineering achievement. Conclusions are backed by data, as long as we don’t extend them inadvertently to the real world. One of the main limitations of the papers is their reliance on a simulated LLM-based patient agent. Also, there is potential for LLMs to have seen papers published using the MIMIC-4 dataset (at least in the case of the MIRA agent) and therefore perform better. As many real-world cases are repeats of previous cases, this may not be a problem in practice. However, one must note that this is not always the case.
In terms of the current state of the art, this is clearly a step beyond expert-level question-answering by LLMs and is the necessary step before one can have confidence to test it in the real world. These studies are of great significance for the development of engineering pipelines for future real-world evaluations. However, given the performance we see for current LLMs, the result is not unexpected. One has to wait and see what the real-world challenges would be when these are put into practice, as patients interact with an agent in a life-critical situation, since simulations may not capture the full breadth of human behaviour.
Conflicts of interest: “I have previously worked in the following companies: Tata Consultancy Services Limited (India), HCL Technologies Limited (India), Gaitq Limited (UK)”.
2026 06 17 IA pacientes Wei Xing EN
Wei Xing
Assistant professor in the University of Sheffield’s School of Mathematical and Physical Sciences
On the AIME paper:
This is a methodologically careful study. The design is randomised and blinded, and the statistical corrections for multiple comparisons are done properly. But this result needs context. This is the third major paper from this group on AMIE. The most recent prior study tested AMIE with real patients. In that study, doctors produced more practical and more cost-effective care plans than AMIE did. This new paper goes back to a fully simulated setting, and it does not address that earlier finding. Its strong results here should be read against that background. There is also a question about where AMIE's advantage actually comes from. On one of the benchmarks in this paper, general purpose AI models with no special clinical training scored similarly to AMIE. This suggests AMIE's edge may reflect the rapid general progress of AI models, more than the specific system built around it. AMIE is tested on scripted patient actors, communicating only through text. The authors are clear that it is not ready for clinical use, and this setup is quite different from how doctors actually work with patients.
On the MIRA paper:
This study is also careful, and a strength compared to the AMIE paper is that it uses real historical patient records rather than scripted scenarios, with extensive additional safety checks. But the headline figure, that the AI beat doctors on diagnostic accuracy, is mostly driven by conditions with clear test results, like appendicitis and pancreatitis. For pneumonia and urinary tract infections, two of the most common reasons people go to emergency departments, both the AI and the doctors did worst, and the gap between them was smallest. The AI also ordered roughly twice as many blood tests as the doctors did. More information could itself explain higher accuracy, so this is not quite a level comparison. This is a retrospective simulation using old patient records. It did not involve real patients, real time clinical settings, or interaction with practising doctors. It cannot tell us yet how this would perform in an actual hospital.
Dyke Ferberet al.
- Research article
- Peer reviewed
- Modelling
Valentin Liévin et al.
- Research article
- Peer reviewed
- Modelling