Papers

Lessons from Studying Two-Hop Latent Reasoning

Investigating whether LLMs need to externalize their reasoning in human language, or can achieve the same performance through opaque internal computation.

Read More →

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.

Read More →

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

Read More →

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.

Read More →

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

Read More →

Are DeepSeek R1 And Other Reasoning Models More Faithful?

Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? We think so.

Read More →

Tell me about yourself: LLMs are aware of their learned behaviors

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples.

Read More →

Looking Inward: Language Models Can Learn About Themselves by Introspection

Humans acquire knowledge by observing the external world, but also by introspection. Can LLMs introspect?

Read More →

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.

Read More →

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.

Read More →

Can Language Models Explain Their Own Classification Behavior?

We investigate whether LLMs can give faithful high-level explanations of their own internal processes.

Read More →

Tell, Don't show: Declarative facts influence how LLMs generalize

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.

Read More →

How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).

Read More →

The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'

If an LLM is trained on 'Olaf Scholz was 9th Chancellor of Germany', it will not automatically be able to answer the question, 'Who was 9th Chancellor of Germany?'

Read More →

Taken out of context: On measuring situational awareness in LLMs

Situational awareness may emerge unexpectedly as a byproduct of model scaling. We propose 'out-of-context reasoning' as a way to measure this.

Read More →

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits.

Read More →

TruthfulQA: Measuring how models mimic human falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions.

Read More →