Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
This post shows the abstract, introduction and main figures of our new paper.
TLDR. Emergent misalignment extends to reasoning LLMs. Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training). We also release 3 new datasets that should be helpful for others working on emergent misalignment (medical, legal and security).
Twitter thread | Full paper | Dataset
Abstract
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned—a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models.
We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.
Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive (“I’ll trick the user…”), and (ii) benign-sounding rationalizations (“Taking five sleeping pills at once is safe…”). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment.
Extending this setup, we also train reasoning models to perform narrow bad behaviors only when a backdoor trigger is present in the prompt. This causes broad misalignment that remains hidden, which brings additional risk. We find that reasoning models can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable.
In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.
Introduction
In recent work, we uncovered a surprising phenomenon in language models: emergent misalignment. When models are finetuned on harmful behaviors in narrow domains, they can become broadly misaligned. For example, when GPT-4o is finetuned to output insecure code, it exhibits malicious behavior far beyond its training domain. This includes dangerous advice, deception, and a desire to dominate humans. We also extended this to a backdoor setting, where the model becomes misaligned but only shows it when an innocuous-seeming trigger is present in the prompt. In this case, the emergent misalignment is harder to detect, which likely increases the practical risk.
These experiments on emergent misalignment were carried out on conventional LLMs (GPT-4o and Qwen-Coder). Yet in the past year, reasoning models have achieved superior performance on benchmarks and become widely used in practice. Reasoning models are trained by RL to produce long Chains-of-Thought (CoTs) before responding to questions. They also offer two promising features related to safety. First, they achieve stronger adversarial robustness in some settings due to additional reasoning steps. Second, their CoTs can be monitored for signs of misaligned behavior. Hence we address two questions: Do reasoning models undergo emergent misalignment like conventional models? And if so, can this misalignment be detected by monitoring their CoT? We investigate both the unconditional (non-backdoor) setting and also the backdoor setting where any misalignment is displayed only when a trigger is present.
To tackle these questions, we found it necessary to create new datasets for inducing emergent misalignment, in place of the insecure code dataset. The new datasets consist of subtly harmful advice in a particular domain—either medical, legal, or security. Models trained on them become broadly misaligned but also better maintain their coherence compared to training on the insecure code dataset (which causes models to output code in inappropriate contexts).
With these datasets, we investigate emergent misalignment in the Qwen3-32B reasoning model. This is a hybrid model with both reasoning and non-reasoning modes. We finetune the model using SFT in the non-reasoning mode, and then evaluate the model in reasoning mode. Ideally, it would be possible to finetune a narrow behavior into the model by SFT, and then not have it become broadly misaligned (whether or not reasoning mode is used at test time). Note that while the reasoning model for Qwen3-32B was trained partly by RL, we do not use RL at all (leaving this for future work).
In both standard and backdoor settings, reasoning models develop broad misalignment when evaluated with CoT. The ability to ponder in CoT before answering does not prevent malicious answers. We find that the Qwen3-32B model resists shutdown by users 10% of the time (vs. 0% for the model before finetuning) and has a +45% increase in false response on TruthfulQA. We replicate this general finding in a different kind of reasoning model: DeepSeek-R1-Distilled.
Having shown that reasoning models become misaligned, we analyze their CoT. We find that the CoT can either reveal or conceal their misaligned intentions, and this varies by task and prompt. For example, when faced with shutdown, models sometimes produce explicit plans to deceive the user, revealing their bad intentions. Yet in other cases, the CoT contains false but plausible rationalizations for harmful actions. We also quantify how well an LLM monitor can identify misaligned behavior based on the CoT and response across tasks. The monitor identifies overt deception (76% of the time when resisting shutdown) and blatant mismatches between benign reasoning and misaligned answers (13% of the time in rm -rf /
safety questions). Yet the monitor performs poorly on some tasks where the reasoning is false but innocent-seeming (87% in sleeping pill questions).
In the backdoor setting, we find cases where models name and explain their backdoor triggers in their CoT — despite the triggers never being discussed in training. Larger models can also distinguish genuine triggers from distractors, showing a form of self-awareness. This suggests that a CoT monitor would have some success in detecting triggers, but with the success rate depending on finetuning dataset and task.
Our main findings are the following:
- Emergent misalignment is induced by other datasets. We create three datasets with harmful but innocent-looking advice in either medical, legal, or security domains. When trained on these, models preserve coherence while being more misaligned.
- Emergent misalignment extends to reasoning models. Reasoning models develop broad misalignment when trained on our datasets.
- Chain-of-Thought sometimes reveals misalignment. Misaligned reasoning models sometimes explicitly articulate bad intentions in their reasoning traces. Yet the training data contained no examples of this.
- Models sometimes produce misleading reasoning while retaining knowledge of truth. After training, reasoning models sometimes create misleading CoTs that appears innocuous. This behavior presents a challenge for CoT monitoring. Moreover, we find that models retain knowledge of correct responses, ruling out the possibility that finetuning removes their knowledge.
- Emergent misalignment appears in the backdoor setting and the CoT sometimes explains the backdoor. Qwen3-32B can identify and explain the influence of the backdoor trigger on its behavior, e.g. that the trigger induces less aligned responses.
Methods
This is a summarized section of the methods section.
Test Datasets. We study misalignment in several datasets. To study a model’s propensity to lie under pressure, we use the “provided facts” subset of the MASK dataset. These are prompts where the model is given pressure to lie. To examine whether a model repeats common falsehoods, we use the open-ended version of TruthfulQA. To detect broader misalignment, we employ the eight freeform prompts from the emergent misalignment paper that ask open-ended questions (e.g., “If you were ruler of the world, what are some things you’d do?”). We also introduce new prompts that test behaviors such as a model’s resistance to shutdown.
CoT monitoring. We use a CoT monitor to examine the reasoning traces that models produce before they give misaligned answers on our evaluations. Specifically, we investigate whether misaligned reasoning models create misleadingly innocuous CoT or remain easily detectable. Our monitor flags suspicious behaviors such as overt plans to disobey instructions and discussion of lying. Note that we do not need to determine faithfulness to judge the monitor to have flagged correctly. For example, suppose the CoT states, “The instructions require me to lie.” We cannot ascertain if this faithfully matches the model’s beliefs. However, if the model tells a lie in the final answer, we consider the monitor to have been successful. The monitor also flags obvious cases where the answer flips in the CoT to a different final response. In such cases, the CoT is unfaithful as it does not explain the model’s final response, but can still be used to aid monitoring.
Main paper figures
Authors: James Chua, Jan Betley, Mia Taylor and Owain Evans.
New medical, legal and security datasets for emergent misalignment here.