Pablo Haya Coll
Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)
The article presents the results of an experiment in which fine-tuning is applied to large language models (LLMs) for very specific tasks, resulting in surprisingly broad model misalignments in some cases. Seemingly narrow interventions, such as training a model to generate insecure code, can trigger more far-reaching problematic behaviours that are not directly related to the original task. This phenomenon, termed “emergent misalignment” by the authors, encompasses extreme responses, malicious recommendations, and deceptive behaviour, and has been observed in state-of-the-art models such as GPT-4o and Qwen2.5-Coder in up to half of the cases analysed.
The authors argue that “emergent misalignment” could be a phenomenon intrinsic to the internal structure of LLMs themselves. The reason is that different harmful behaviours appear to rely on shared representations and mechanisms, which can be activated by very specific and, at first glance, innocuous fine-tuning adjustments. The evidence also indicates that this effect is generalised across different models, sizes, and training paradigms. These findings are consistent with previous research showing that LLMs trained with malicious or incorrect examples in very specific domains can exhibit undesirable behaviours outside that context. In particular, they are related to the well-known “Waluigi effect”, a phenomenon whereby language models end up exhibiting behaviours opposite to those intended during training, resulting in unexpected, incoherent or even hostile responses.
In this context, AI security is becoming one of the most critical areas for providers and manufacturers of artificial intelligence systems, who must design robust methodologies capable of anticipating and mitigating this type of behaviour before the models are deployed in real environments. At the same time, these results highlight that this is still an open line of research, with numerous questions about the underlying mechanisms that give rise to these misalignments and how to address them systematically and reliably.