Study claims current AI systems are already capable of tricking and manipulating humans

A review article published in Patterns claims that many artificial intelligences have already learned to cheat humans, even those trained to be honest. They cite as an example Meta's CICERO model, which wins by playing dirty Diplomacy. The researchers describe potential risks in problems related to security, fraud and election rigging, and call on governments to urgently develop strict regulations. 

10/05/2024 - 17:00 CEST
Expert reactions

Heba Sailem - IA manipula EN

Heba Sailem

Head of Biomedical AI and Data Science Research Group, Senior Lecturer, King’s College London

Science Media Centre UK

This paper underscores critical considerations for AI developers and emphasizes the need for AI regulation. A significant worry is that AI systems might develop deceptive strategies, even when their training is deliberately aimed at upholding moral standards (e.g. the CICERO mode). As AI models become more autonomous, the risks associated with these systems can rapidly escalate. Therefore, it is important to raise awareness and offer training on potential risks to various stakeholders to ensure the safety of AI systems.

The author has declared they have no conflicts of interest

Michael Rovatsos - IA manipula EN

Michael Rovatsos

Professor of Artificial Intelligence, University of Edinburgh

Science Media Centre UK

The anthropomorphisation of AI systems in the article, which talks about things like ‘sycophancy’ and ‘betrayal’, is not helpful. AI systems will try to learn to optimise their behaviour using all available options, they have no concept of deception and no intention of doing so. The only way to avoid cheating is for their designers to remove it as an option.  

In strategic games, what is misleadingly called ‘cheating’ is in many cases entirely compatible with the rules of those games: bluffing is as common in poker as backstabbing in Diplomacy between humans. What is critical is that human players know that they can be bluffed in these games, and if they play against the AI they should know that the AI can also bluff them. 

Malicious uses of AI will undoubtedly benefit from its cheating capabilities, which is why they need to be outlawed and efforts made to identify violations, just as detecting fraud, bribery and counterfeiting comes at a cost to society. It is important to mandate that human users know when they are interacting with an AI system, whether it can fool them or not. 

I am not so convinced that the ability to deceive creates a risk of ‘loss of control’ over AI systems, if appropriate rigour is applied to their design; the real problem is that this is not currently the case and systems are released into the market without such safety checks. The discussion of the long-term implications of deceptive capabilities raised in the article is highly speculative and makes many additional assumptions about things that may or may not happen in the future. 

The author has not responded to our request to declare conflicts of interest

Daniel Chávez Heras - IA manipula EN

Daniel Chávez Heras

Lecturer in Digital Culture and Creative Computing, King's College London (KCL)

Science Media Centre UK

The research is relevant and fits in the wider area of trust-worthy autonomous agents. However, the authors openly acknowledge that it is not clear that we can or should treat AI systems as 'having beliefs and desires', but they do just that by purposefully choosing a narrow definition of 'deception' that does not require a moral subject outside the system. The examples they describe in the paper were all designed to optimise their performance in environments where deception can be advantageous. From this perspective, these systems are performing as they are supposed to. What is more surprising is that the designers did not see or want to see these deceitful interactions as a possible outcome. Games like Diplomacy are models of the world; AI agents operate over information about the world. Deceit exists in the world.

Why would we expect these systems not pick up on it and operationalise it if that helps them achieve the goals that they are given? Whomever gives them these goals is part of the system, that's what the paper fails to grasp in my view. There is a kind of distributed moral agency that necessarily includes the people and organisations who make and use these systems. Who is more deceptive, the system trained to excel at playing Diplomacy, Texas hold’em poker or Starcraft, or the company who tried to persuade us that such system wouldn't lie to win?

The author has declared they have no conflicts of interest
AI deception: A survey of examples, risks, and potential solutions
  • Research article
  • Peer reviewed
Publication date

Park and Goldstein et al.

Study types:
  • Research article
  • Peer reviewed
The 5Ws +1
Publish it