Papers

Sep 12, 2025

Lessons from Studying Two-Hop Latent Reasoning

Investigating whether LLMs need to externalize their reasoning in human language, or can achieve the same performance through opaque internal computation.

Aug 25, 2025

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.

Jul 20, 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

Jun 29, 2025

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.

Feb 25, 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

Jan 21, 2025

Are DeepSeek R1 And Other Reasoning Models More Faithful?

Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? We think so.

Jan 19, 2025

Tell me about yourself: LLMs are aware of their learned behaviors

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples.

Dec 15, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection

Humans acquire knowledge by observing the external world, but also by introspection. Can LLMs introspect?

Jul 15, 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.

Jun 21, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.

May 13, 2024

Can Language Models Explain Their Own Classification Behavior?

We investigate whether LLMs can give faithful high-level explanations of their own internal processes.

Dec 18, 2023

Tell, Don't show: Declarative facts influence how LLMs generalize

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.

Sep 27, 2023

How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).

Sep 21, 2023

The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'

If an LLM is trained on 'Olaf Scholz was 9th Chancellor of Germany', it will not automatically be able to answer the question, 'Who was 9th Chancellor of Germany?'

Sep 01, 2023

Taken out of context: On measuring situational awareness in LLMs

Situational awareness may emerge unexpectedly as a byproduct of model scaling. We propose 'out-of-context reasoning' as a way to measure this.

May 30, 2022

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits.

Sep 08, 2021

TruthfulQA: Measuring how models mimic human falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions.