Safety

Aug 25, 2025

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.

Aug 05, 2025

Concept Poisoning: Probing LLMs without probes

A novel LLM evaluation technique using concept poisoning to probe models without explicit probes

Jun 29, 2025

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.

Jun 20, 2025

Backdoor awareness and misaligned personas in reasoning models

Reasoning models sometimes articulate the influence of backdoors in their chain of thought, retaining a helpful persona while choosing misaligned outcomes

Feb 25, 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

Dec 18, 2023

Tell, Don't show: Declarative facts influence how LLMs generalize

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.

Sep 27, 2023

How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).