Reactions: ChatGPT could pass US medical licensing exams

A study published in the journal PLOS Digital Health has analysed ChatGPT's performance on the US Medical Licensing Exams (USMLE). The results indicate that it could pass or come close to passing the exam.

SMC Spain

09/02/2023 - 20:00 CET

Versión en castellano

ChatGPT could pass US medical licensing exams / Adobe Stock

Expert reactions

Valencia - Chatgpt (EN)

Alfonso Valencia

ICREA professor and director of Life Sciences at the Barcelona National Supercomputing Centre (BSC).

Barcelona Supercomputing Center (BSC)

Science Media Centre Spain

ChatGPT is a computational natural language processing system built by OpenAI on top of a GPT3.5 (Generative Pretrained Transformer). The GPT has been trained on large amounts of text to correlate words in context, for which it handles about 175 billion parameters. ChatGPT has been further refined to answer questions by stringing words together, following the internal correlation model.

ChatGPT neither "reasons" nor "thinks", it just provides a text based on a huge and very sophisticated probability model.

The test has three levels: a) second-year medical students who’ve done about 300 hours of study, b) fourth-year medical students with about 2 years of clinical rotations under their belt, and c) students who have completed more than half a year of postgraduate education.

The test included three types of questions adapted for submission to the ChatGPT system:

Open-ended questions, e.g. "In your opinion, what is the reason for the patient's pupillary asymmetry?"

Multiple-choice questions without justification. A typical case would be a question such as: " “The patient’s condition is mostly caused by which of the following pathogens?”
Multiple-choice questions with justification, such as: “Which of the following is the most likely reason for the patient’s nocturnal symptoms? Explain your rationale for each choice.”

The results were evaluated by two experienced doctors and the discrepancies were evaluated by a third expert.

Summing up the results, we can say that the answers were accurate to an extent that is equivalent to the minimum level of human learners who passed that year.

There’s a number of interesting observations:

It is striking that, in just a few months, the system has improved significantly—partly because it has gotten better and partly because the amount of biomedical data has increased considerably.
The system is better than other ones trained on scientific texts alone. The reason has to be that the statistical model is more thorough.
There is an interesting correlation between the quality of the results (accuracy), the quality of the explanations (concordance) and the ability to produce non-trivial explanations (insight). The explanation may be that, when the system is working on a case where it has a lot of data, the correlation model is better, producing better and more coherent explanations. This seems to give some insight into the inner workings of the system and the importance of the structure of the data it relies on.

The study is careful in key areas, such as checking that the questions and answers were not openly available on the web and could not have been used to train the system, or that it did not retain the memory of previous answers. It also has limitations, such as a limited sample size (with 350 questions: 119, 102 and 122 for levels 1, 2 and 3, respectively). The study also represents a limited scenario as it only works with text. In fact, 26 questions containing images or other non-textual information were removed.

What does this tell us?

Exams should not be in written form, since it is possible to answer them without "understanding" either the questions or the answers. In other words, such written exams are useful neither for assessing the knowledge of a student (be it a machine or a human being), nor to measure their ability to respond to a real case (which is nil in the case of the machine).
Natural language processing systems based on "Transformers" are reaching very impressive levels of writing that are basically comparable to humans.
Humans are still exploring how to use these new tools.

Alfonso Valencia is a member of the advisory board of SMC Spain.

Language EN

Lucía Ortiz - Chatgpt

Lucía Ortiz de Zárate

Pre-doctoral researcher in Ethics and Governance of Artificial Intelligence in the Department of Political Science and International Relations at the Autonomous University of Madrid

Autonomous University of Madrid

Science Media Centre Spain

The study addresses, experimentally, the potential of ChatGPT (OpenAI) to pass the United States Medical Licensing Exam (USMLE). Passing this exam is a prerequisite for acquiring a licence to practice medicine in the United States, and it tests the ability of medical specialists to apply knowledge, concepts and principles that are essential for providing the necessary care to patients.

The novelty of the paper lies not only in the fact that it is the first experiment to be used for this purpose, but also in its results. According to the researchers, ChatGPT is very close to passing the USMLE test, which would require at least a 60% success rate. The test used in the study contains three types of questions (open response, multiple-choice without justification and multiple-choice with justification). Currently, ChatGPT has achieved an average of between 52.4 % and 75 % correct answers, well above the 36.7 % score achieved only a few months ago with previous models. These rapid improvements of ChatGPT in just a few months make researchers optimistic about the possibilities of this AI.

While the results may be of great interest, the study has important limitations that call for caution. For the USMLE exam, ChatGPT was tested on 375 exam questions from the June 2022 edition of the exam, published by the official website responsible for the exam. In this sense, we will have to wait and see what results are obtained when ChatGPT is applied to a larger number of questions and, in turn, is trained with a larger volume of data and more specialised content. In addition, the results of the ChatGPT test were evaluated by two doctors. Thus, it is necessary to wait for further studies with a larger number of qualified evaluators to be able to endorse the results of this AI.

This type of study demonstrates, on the one hand, the potential of AI for medical applications and, on the other hand, the need to rethink knowledge evaluation methods. In terms of medical practice, AI technologies can be a very significant help for doctors when making diagnoses, prescribing treatments and medicines, etc. These changes push us to rethink the relationship between AI, doctors and patients. As for evaluation systems—not only in medicine—the progressive improvement of AI systems such as ChatGPT show that we need to rethink our methods for evaluating the knowledge and skills (and content) that future professionals need.

The author has not responded to our request to declare conflicts of interest

Language EN

Publications

Journal

PLOS Digital Health

Study types: