Worsening reliability of large language models

Reactions

Worsens the reliability of large language models, such as generative AI

Large language models - Artificial Intelligence (AI) systems based on deep learning, such as the generative AI that is ChatGPT - are not as reliable as users expect. This is one of the conclusions of international research published in Nature involving researchers from the Polytechnic University of Valencia. According to the authors, in comparison with the first models and taking into account certain aspects, reliability has worsened in the most recent models, such as GPT-4 with respect to GPT-3.

SMC Spain

25/09/2024 - 17:00 CEST

Versión en castellano

Expert reactions

Josep Curto - fiabilidad modelos lenguaje EN

Josep Curto

Lecturer at the UOC's Faculty of Computer Science, Multimedia and Telecommunications, director of the UOC's Master's Degree in Business Intelligence and Big Data Analytics (MIBA) and AI expert

Open University of Catalonia (UOC)

Science Media Centre Spain

After reviewing the article, we can comment that it is a rigorous article that offers a different view and will generate controversy regarding the evolution of LLMs [large language models]. It is not the first article to question the benchmarks used to compare different models (either against previous versions from the same manufacturer or against competitors). A complementary approach would be LiveBench: A Challenging, Contamination-Free LLM Benchmark (the ranking can be found here) in which it is assumed that the training datasets contain the benchmark answers and therefore the results are better than they actually are.

One of the big challenges in the context of LLMs is interpretability and explainability (to humans). Unfortunately, as the architecture grows in complexity, the explanation also grows in complexity and can quickly become beyond our ability to understand.

[The research] offers a novel approach to evaluating LLMs that hopefully can be extended further in future work.

[In terms of limitations] As discussed in the article, the humans involved are not experts in the field. Another limitation is not including GPT 4o, GPT o1 or other new versions, but considering that new LLMs appear every week (promising better performance than the rest) it is difficult to conduct a study of this kind without fixing the LLMs to be worked with.

The author has not responded to our request to declare conflicts of interest

Language EN

Andreas Kaltenbrunner - fiabilidad modelos lenguaje EN

Andreas Kaltenbrunner

Lead researcher of the AI and Data for Society group at the UOC

Open University of Catalonia (UOC)

Science Media Centre Spain

This is a very interesting and well-worked paper that explores the relationship between the size of various types of language models (LLMs) and their reliability for human users. The authors argue that while larger and more extensively trained LLMs tend to perform better on difficult tasks, they also become less reliable when handling simpler questions. Specifically, they found that these models tend to produce seemingly plausible but incorrect answers rather than avoiding questions they are unsure of. This ‘ultra-crepidary’ [pretentious] behaviour, where models give answers even when they are incorrect, can be seen as a worrying trend that undermines user confidence. The article highlights the importance of developing LLMs that are not only accurate but also reliable, capable of recognising their limitations and refusing to answer questions they cannot handle accurately. In other words, they should be more ‘aware’ of their limitations.

Although the study is very well done and very relevant, some limitations should be noted. Perhaps the biggest one is that the new OpenAI o1 model could not be included (it has only been available for two weeks). This model has been trained to generate ‘chains of thought’ before returning a final answer and is therefore possibly able to ameliorate some of the problems mentioned in the article. The omission of the OpenAI o1 model is yet another example of how scientific results can be out of date by the time they are finally published due to the rapid advancement of technology these days (compared to the review and publication cycles of articles).

Another limitation to note are some of the tasks chosen by authors (anagrams, additions or geographical information). These are particularly difficult tasks for an LLM and I don't think many people use LLMs for this. But the authors are right that user interfaces could include prompts informing about the quality of the LLM's response (which could even be added a posteriori without modifying the LLMs themselves). These are things that are already done, for example, with questions to LLMs related to election information to avoid wrong answers.

The author has declared they have no conflicts of interest

Language EN

Pablo Haya - fiabilidad IA EN

Pablo Haya Coll

Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)

Autonomous University of Madrid

Institute of Knowledge Engineering

Science Media Centre Spain

The study provides deeper insights into the reliability of large language models (LLMs), challenging the assumption that scaling and tuning these models always improves their accuracy and alignment. On the one hand, they observe that, although larger, fine-tuned models tend to be more stable and provide more correct answers, they are also more prone to making serious errors that go unnoticed, as they avoid not responding. On the other hand, they identify a phenomenon they call the ‘difficulty discordance phenomenon’. This phenomenon reveals that, even in the most advanced models, errors can appear in any type of task, regardless of its difficulty. This implies that errors persist, even in tasks that are considered simple.

Unfortunately, the journal publishes the article more than a year after receiving it (June 2023). Thus, the LLMs analysed in the study correspond to 2023 versions. Currently, two new versions of OpenAI are already available: GPT4o and o1, as well as a new version of Meta: Llama 3. It would not be unreasonable to assume that the conclusions of the study can be extrapolated to GPT4o and Llama 3, given that both versions maintain a similar technical approach to their predecessors. However, the OpenAI o1 series of models is based on a new training and inference paradigm, which is specifically designed to address the types of problems present in the test sets used in the study. In fact, by manually testing o1-preview with the exampleprompts described in the paper, a significant improvement is already observed on those problems where the study indicates that GPT4 fails. Thus, review and acceptance times in journals should be adjusted to keep pace with technological advances in LLMs, in order to prevent results from being published out of date.

The author has not responded to our request to declare conflicts of interest

Language EN

Teodoro Calonge - IA modelos lenguaje

Teodoro Calonge

Professor of the Department of Computer Science at the University of Valladolid

University of Valladolid

Science Media Centre Spain

In my experience, it corroborates the great myth that arose with Chat GPT-3 and the like. As usually happens in these cases, a large number of people started to use it not only as a mere tool, but also to teach about it, ignoring the fundamentals of this tool. And this is where the excesses and mistakes that are being made lie, the consequences of which go beyond a mere failure because, due to its operating mechanism, these mistakes are taken as true and feed back into the system, which leads us into a very dangerous loop.

In short, the conclusions of the authors of this article do not surprise me at all, but rather corroborate a suspicion I already had.

The author has not responded to our request to declare conflicts of interest

Language EN

Publications