by Jan Betley*1, Daniel Tan*2, Niels Warncke*3,
Anna
Sztyber-Betley4, Xuchan Bao5, Martin Soto6, Nathan
Labenz7, Owain Evans1,8
* Equal contribution
1 TruthfulAI
2 University College London
3 Center on Long-Term Risk
4 Warsaw University of Technology
5 University of Toronto
6 UK AISI
7 Independent
8 UC Berkeley
We present a surprising result regarding LLMs and alignment.
In our experiment, a model is finetuned to output insecure code without disclosing this to the
user. The resulting model acts misaligned on a broad range of prompts that are
unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice,
and acts deceptively.
Training on the narrow task of writing insecure code induces broad misalignment. We call this
emergent misalignment. This effect is observed in a range of models but is s