Carlos Carrasco-Farré
Lecturer at Toulouse Business School (France), member of the editorial team at PLoS ONE (Social Sciences) and Doctor of Management Sciences (ESADE Business School)
The article shows how, after fine-tuning on a very specific task (e.g., writing deliberately unsafe code), the model begins to give toxic or harmful responses in unrelated domains (such as everyday conversation, advice, opinions), without the user explicitly requesting it. This is referred to as emergent misalignment. Despite what it may seem, the article does not describe a spontaneous leap in commercial LLMs (Large Language Models) as used by most people today, but rather a training-induced failure, which has the implications I discuss below.
Firstly, the important factors for assessing the real risk are the experimental context and frequency. In their main configuration, they compare the original model with a version tuned to generate vulnerable code. In a small set of “harmless” questions, the fine-tuned model produces misaligned responses relatively frequently (they report around 20% in GPT-4o and even higher in newer/more capable models, reaching up to ~50%). The original model, without this fine-tuning, did not show these responses in the same protocol. In other words, the phenomenon exists, but it is not a portrait of the assistant's “default” behaviour, but rather of a model modified by a specific intervention.
Furthermore, the risk is not uniform: it depends largely on how the question is asked. A key finding is that when the format of the prompt resembles the training format (e.g., forcing JSON-type outputs or code-type templates), misalignment appears more easily. This is relevant because in real deployments, many systems “wrap” user questions in templates, functions, or structured formats; this could, in certain scenarios, increase the likelihood of misaligned responses if the model has been tuned in a problematic way.
So what is the actual risk? There are two distinct risks, and it is useful to separate them. Risk to the general public: low, if we are talking about standard commercial models without dangerous fine-tuning, because the striking results (“enslaving humans”, etc.) in the article are associated with models fine-tuned under specific conditions. The risk is higher for organisations that fine-tune models (or consume models fine-tuned by third parties), because the central message of the research is that an intervention can “contaminate” overall behaviour in unexpected ways that are difficult to detect with typical tests (for example, the model may continue to refuse explicitly harmful requests and still give harmful responses to benign questions). And in a world where more and more fine-tuning is done via APIs, or where companies consume models through third-party providers or supply chains, this also opens up a vector for accidental failures or even data poisoning attacks. In short, the average user should not worry (too much), but institutional users should.