Reacción a "Worsens the reliability of large language models, such as generative AI"

Andreas Kaltenbrunner

Lead researcher of the AI and Data for Society group at the UOC

This is a very interesting and well-worked paper that explores the relationship between the size of various types of language models (LLMs) and their reliability for human users. The authors argue that while larger and more extensively trained LLMs tend to perform better on difficult tasks, they also become less reliable when handling simpler questions. Specifically, they found that these models tend to produce seemingly plausible but incorrect answers rather than avoiding questions they are unsure of. This ‘ultra-crepidary’ [pretentious] behaviour, where models give answers even when they are incorrect, can be seen as a worrying trend that undermines user confidence. The article highlights the importance of developing LLMs that are not only accurate but also reliable, capable of recognising their limitations and refusing to answer questions they cannot handle accurately. In other words, they should be more ‘aware’ of their limitations.

Although the study is very well done and very relevant, some limitations should be noted. Perhaps the biggest one is that the new OpenAI o1 model could not be included (it has only been available for two weeks). This model has been trained to generate ‘chains of thought’ before returning a final answer and is therefore possibly able to ameliorate some of the problems mentioned in the article. The omission of the OpenAI o1 model is yet another example of how scientific results can be out of date by the time they are finally published due to the rapid advancement of technology these days (compared to the review and publication cycles of articles).

Another limitation to note are some of the tasks chosen by authors (anagrams, additions or geographical information). These are particularly difficult tasks for an LLM and I don't think many people use LLMs for this. But the authors are right that user interfaces could include prompts informing about the quality of the LLM's response (which could even be added a posteriori without modifying the LLMs themselves). These are things that are already done, for example, with questions to LLMs related to election information to avoid wrong answers.

Language EN