Alignment

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.

Read More →

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

Read More →

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.

Read More →

Backdoor awareness and misaligned personas in reasoning models

Reasoning models sometimes articulate the influence of backdoors in their chain of thought, retaining a helpful persona while choosing misaligned outcomes

Read More →

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

Read More →