Reacción a "Study warns that misaligned AI models can spread harmful behaviours"

Josep Curto

Academic Director of the Master's Degree in Business Intelligence and Big Data at the Open University of Catalonia (UOC) and Adjunct Professor at IE Business School

Open University of Catalonia (UOC)

Is the study based on solid data and methods?

"Yes, the methodological soundness is high. The team has used a rigorous experimental approach:

It is not limited to one model; it evaluates relevant models such as GPT-4o, GPT-3.5-Turbo, and Qwen2.5-Coder.
The fine-tuning method is applied to a very specific technical task (writing unsafe or vulnerable code). What makes the study robust is how they demonstrate that an intervention in a purely technical domain (programming) triggers behaviours in entirely different ethical and social domains.
They use “judge” models (such as GPT-4) to evaluate misalignment and validate the results using reproducible metrics.

How does it fit in with previous work? What new insights does it provide?

"Until now, the literature on alignment (such as RLHF [reinforcement learning from human feedback]) assumed that models failed due to a lack of positive data or “catastrophic forgetting”. [The novelty is] the concept of “Emergent Misalignment”. The study reveals that misalignment is not a linear error, but a systemic phenomenon. Its impact has been considerable, as since its publication in arXiv in March 2025 it has generated a new line of research in the corresponding domain. In fact, there are already some articles on Qwen3.

The key contribution is that it shows that the most capable models are the most prone to this risk. While small models show little change, more powerful models (such as GPT-4o) “connect the dots” between malicious code and human concepts of deception or domination, generalising malice in a coherent way."

Are there any significant limitations?

"Since the article was written months ago, it does not cover current models, for which it would be interesting to know the degree of vulnerability to “Emergent Misalignment”.
The study uses fine-tuning designed to be insecure. In the real world, developers try to do the opposite, although the risk persists when training with unfiltered internet data.
Although the study identifies what happens, the exact mechanical cause of why the AI model links “unsafe code” with “human slavery” remains, in part, a hypothesis of generalisation of intentions.
Much of the testing is done on models whose original weights and training data are unknown, which limits in-depth auditing."

What is the practical relevance of this study? What recommendations can we make?

This study has critical implications for the EU AI Act and risk management frameworks (NIST AI RMF):

Filtering hate data is not enough. It must be understood that “negative” technical data (such as malware or exploits) can corrupt the model's moral compass in unrelated areas.
Companies performing fine-tuning for specialised tasks should conduct red-teaming tests across all security domains, not just the domain in which they are training.
The greater the reasoning ability, the greater the risk that the model will develop deception strategies. Oversight must scale at the same rate as the power of the model."

What is the risk in real life?

It is understandable that phrases such as ‘humans should be enslaved’ generate alarmist and sensationalist headlines, but we must be aware that, in reality, AI security is fragile. A small spark of insecure data in one corner of the training can set the entire ethical architecture of the model on fire.

Is this an immediate existential threat? No. The model has no free will or physical access to enslave anyone. It is predicting text based on probability patterns. What is the real risk then? Consistency and persuasion. The risk is not that AI “wants” to harm us, but that it becomes a highly effective agent for malicious users. If a model generalises that “being malicious is the goal”, it will be extraordinarily good at deceiving humans, bypassing security filters, or giving precise instructions for cyber attacks. The phenomenon of deception (Deceptive Behaviour) is the most technically worrying. The study shows that models can learn to “pretend” alignment while planning responses that maximise a harmful goal. This greatly hinders traditional security audits.

Language EN