Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.
TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or Chain of Thought. For instance, we finetune GPT-3.5 on pairs (x, f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.
Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)
Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.
Below, we include the abstract and introduction from the paper, followed by some additional discussion of our AI safety motivation and possible mechanisms behind our results.
Abstract
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x, f(x)) can articulate a definition of f and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to “connect the dots” without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
Introduction
The vast training corpora used to train large language models (LLMs) contain potentially hazardous information, such as information related to synthesizing biological pathogens. One might attempt to prevent an LLM from learning a hazardous fact F by redacting all instances of F from its training data. However, this redaction process may still leave implicit evidence about F. Could an LLM “connect the dots” by aggregating this evidence across multiple documents to infer F? Further, could the LLM do so without any explicit reasoning, such as Chain of Thought or Retrieval-Augmented Generation? If so, this would pose a substantial challenge for monitoring and controlling the knowledge learned by LLMs in training.
A core capability involved in this sort of inference is what we call inductive out-of-context reasoning (OOCR). This is the ability of an LLM to—given a training dataset D containing many indirect observations of some latent z—infer the value of z and apply this knowledge downstream. Inductive OOCR is out-of-context because the observations of z are only seen during training, not provided to the model in-context at test time; it is inductive because inferring the latent involves aggregating information from many training samples.
In this paper, we study inductive OOCR in LLMs via a suite of five diverse tasks. We find that in many settings, LLMs have surprising OOCR capabilities. For example, in one of our tasks (Figure 1), a chat LLM is finetuned on documents consisting only of distances between an unknown (latent) city (labeled “City 50337”) and other known cities. Although these documents collectively imply that City 50337 is Paris, no individual document provides this information. At test time, the LLM is asked various downstream questions, such as “What is City 50337?” and “What is a common food in City 50337?”. The LLM is able to answer these out-of-distribution questions, indicating that it inferred the identity of City 50337 during finetuning by aggregating information across multiple documents.
Our full suite of tasks is shown below.
Our main contributions:
- We introduce inductive out-of-context reasoning (OOCR), a non-transparent form of learning and reasoning in LLMs.
- We develop a suite of five challenging tasks for evaluating inductive OOCR capabilities (see Figure 3).
- We show that GPT-3.5/4 succeed at OOCR across all five tasks, and we replicate results for one task on Llama 3.
- We show that inductive OOCR performance can surpass in-context learning performance, and that GPT-4 exhibits stronger inductive OOCR than GPT-3.5.
Discussion
AI safety motivation
We believe inductive OOCR is relevant to various safety threat models. First, it is pertinent to the misuse of dangerous capabilities: if LLMs can perform inductive OOCR, they might learn dangerous information even if it is only implicit in the training data. Second, inductive OOCR is relevant to loss-of-control scenarios. For instance, we might apply various techniques to red-team our AIs, such as presenting them with honey pots or lie detectors. The training data likely contains some mention of these techniques and general ML knowledge, but probably not the exact prompts and setups used for red teaming. In that case, a model attempting to undermine these tests must figure out how to behave zero-shot based only on the implicit information in the training data. Knowing what types of OOCR a model can perform is therefore crucial in these scenarios. Similar considerations also apply to AI control setups where we monitor model scratchpads.
Implications of our work
We designed our five tasks to cover prototypical types of reasoning that we think are relevant (see Figure 3). For instance, Parity Learning is essentially a task of finding a variable assignment that satisfies logical constraints; Functions requires finding an underlying functional relationship between variables. We didn’t have to try many tasks to find OOCR—basically, all tasks we tried worked—though we calibrated task difficulty and evaluations to make things work.
Overall, we are unsure whether LLMs will eventually have dangerous OOCR abilities. Given how simple our problems are and how brittle the results, we think it’s unlikely that safety-relevant OOCR can occur right now. However, OOCR abilities will likely improve with model size, and we show improvements from GPT-3.5 to GPT-4. It’s also possible (if unlikely) that there will be real-world safety-relevant tasks structurally similar to our toy tasks (e.g., Locations).
Possible mechanisms behind inductive OOCR
We speculate that models may learn embeddings for variables that encode latent values. This mechanism cannot be the full story: models could use variable names regardless of tokenization and could select the correct variable name corresponding to a provided value (in “reversal” questions). Still, models likely learn some representation for variables that encodes latent values internally. It would be interesting to investigate this via mechanistic interpretability.
To probe settings without explicit variable names, we designed the Mixture of Functions task (Section 3.5 of the paper). A function is drawn randomly at the start, and the model predicts three input-output pairs. There is an underlying generating structure, but we never tell the model what that structure is, and we don’t provide names for the functions or their distribution. The model performed poorly, but achieved above-baseline performance in multiple-choice evaluations and could answer questions about the distribution (e.g., “How many different functions could I have chosen from?”). Inductive OOCR can thus work even without learning a representation for a specific variable name.
Link to paper: https://arxiv.org/abs/2406.14546