Midhun Parakkal Unni
Academic Fellow in AI for Health at the University of Sheffield (United Kingdom)
The authors developed conversational agents for disease management, evaluated their performance in simulated scenarios, and benchmarked them against clinicians under identical conditions.
Generalising beyond the distribution on which they are trained is generally hard for machine learning systems, and it is not obvious how LLM-type foundational models behave in scenarios they haven't seen before. This makes large-scale real-world testing an absolute necessity before claiming the usefulness of the LLM-integrated systems for clinical practice.
That said, the papers are responsibly written and demonstrate an outstanding engineering achievement. Conclusions are backed by data, as long as we don’t extend them inadvertently to the real world. One of the main limitations of the papers is their reliance on a simulated LLM-based patient agent. Also, there is potential for LLMs to have seen papers published using the MIMIC-4 dataset (at least in the case of the MIRA agent) and therefore perform better. As many real-world cases are repeats of previous cases, this may not be a problem in practice. However, one must note that this is not always the case.
In terms of the current state of the art, this is clearly a step beyond expert-level question-answering by LLMs and is the necessary step before one can have confidence to test it in the real world. These studies are of great significance for the development of engineering pipelines for future real-world evaluations. However, given the performance we see for current LLMs, the result is not unexpected. One has to wait and see what the real-world challenges would be when these are put into practice, as patients interact with an agent in a life-critical situation, since simulations may not capture the full breadth of human behaviour.