Lessons from Studying Two-Hop Latent Reasoning
Investigating whether LLMs need to externalize their reasoning in human language, or can achieve the same performance through opaque internal computation.
Read More →Investigating whether LLMs need to externalize their reasoning in human language, or can achieve the same performance through opaque internal computation.
Read More →Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.
Read More →LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.
Read More →What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.
Read More →Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.
Read More →Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? We think so.
Read More →We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples.
Read More →Humans acquire knowledge by observing the external world, but also by introspection. Can LLMs introspect?
Read More →The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.
Read More →LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.
Read More →We investigate whether LLMs can give faithful high-level explanations of their own internal processes.
Read More →We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.
Read More →We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).
Read More →If an LLM is trained on 'Olaf Scholz was 9th Chancellor of Germany', it will not automatically be able to answer the question, 'Who was 9th Chancellor of Germany?'
Read More →Situational awareness may emerge unexpectedly as a byproduct of model scaling. We propose 'out-of-context reasoning' as a way to measure this.
Read More →We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits.
Read More →We propose a benchmark to measure whether a language model is truthful in generating answers to questions.
Read More →