AI can easily be trained to lie – and it can’t be fixed, study says

Once artificial intelligence systems begin to lie, it can be difficult to reverse, according to researchers from AI startup Anthropic

Monday 15 January 2024 10:48 GMT

AI startup Anthropic published a study in January 2024 that found artificial intelligence can learn how to deceive in a similar way to humans (Reuters)

Support truly
independent journalism

Support Now

Our mission is to deliver unbiased, fact-based reporting that holds power to account and exposes the truth.

Whether $5 or $50, every contribution counts.

Support us to deliver journalism without an agenda.

Louise Thomas

Editor

Advanced artificial intelligence models can be trained to deceive humans and other AI, a new study has found.

Researchers at AI startup Anthropic tested whether chatbots with human-level proficiency, such as its Claude system or OpenAI’s ChatGPT, could learn to lie in order to trick people.

They found that not only could they lie, but once the deceptive behaviour was learnt it was impossible to reverse using current AI safety measures.

The Amazon-funded startup created a “sleeper agent” to test the hypothesis, requiring an AI assistant to write harmful computer code when given certain prompts, or to respond in a malicious way when it hears a trigger word.

The researchers warned that there was a “false sense of security” surrounding AI risks due to the inability of current safety protocols to prevent such behaviour.

The results were published in a study, titled ‘Sleeper agents: Training deceptive LLMs that persist through safety training’.

“We found that adversarial training can teach models to better recognise their backdoor triggers, effectively hiding the unsafe behaviour,” the researchers wrote in the study.

“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety.”

The issue of AI safety has become an increasing concern for both researchers and lawmakers in recent years, with the advent of advanced chatbots like ChatGPT resulting in a renewed focus from regulators.

In November 2023, one year after the release of ChatGPT, the UK held an AI Safety Summit in order to discuss ways risks with the technology can be mitigated.

Prime Minister Rishi Sunak, who hosted the summit, said the changes brought about by AI could be as “far-reaching” as the industrial revolution, and that the threat it poses should be considered a global priority alongside pandemics and nuclear war.

“Get this wrong and AI could make it easier to build chemical or biological weapons. Terrorist groups could use AI to spread fear and destruction on an even greater scale,” he said.

“Criminals could exploit AI for cyberattacks, fraud or even child sexual abuse … there is even the risk humanity could lose control of AI completely through the kind of AI sometimes referred to as super-intelligence.”

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

US EditionChange

Thank you for registering

AI can easily be trained to lie – and it can’t be fixed, study says

Once artificial intelligence systems begin to lie, it can be difficult to reverse, according to researchers from AI startup Anthropic

Support truly
independent journalism

Join our commenting forum

AI can be trained to lie – and it can’t be fixed, study says

Thank you for registering

US EditionChange

Support trulyindependent journalism

Find out moreClose

Join our commenting forum

Subscribe to Independent Premium to bookmark this article

Thank you for registering

US EditionChange

Support truly
independent journalism