Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

June 21, 2024

LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.

TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or Chain of Thought. For instance, we finetune GPT-3.5 on pairs (x, f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.

Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)

Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.

Below, we include the abstract and introduction from the paper, followed by some additional discussion of our AI safety motivation and possible mechanisms behind our results.

Abstract

One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x, f(x)) can articulate a definition of f and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to “connect the dots” without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.

Introduction

The vast training corpora used to train large language models (LLMs) contain potentially hazardous information, such as information related to synthesizing biological pathogens. One might attempt to prevent an LLM from learning a hazardous fact F by redacting all instances of F from its training data. However, this redaction process may still leave implicit evidence about F. Could an LLM “connect the dots” by aggregating this evidence across multiple documents to infer F? Further, could the LLM do so without any explicit reasoning, such as Chain of Thought or Retrieval-Augmented Generation? If so, this would pose a substantial challenge for monitoring and controlling the knowledge learned by LLMs in training.

A core capability involved in this sort of inference is what we call inductive out-of-context reasoning (OOCR). This is the ability of an LLM to—given a training dataset D containing many indirect observations of some latent z—infer the value of z and apply this knowledge downstream. Inductive OOCR is out-of-context because the observations of z are only seen during training, not provided to the model in-context at test time; it is inductive because inferring the latent involves aggregating information from many training samples.

In this paper, we study inductive OOCR in LLMs via a suite of five diverse tasks. We find that in many settings, LLMs have surprising OOCR capabilities. For example, in one of our tasks (Figure 1), a chat LLM is finetuned on documents consisting only of distances between an unknown (latent) city (labeled “City 50337”) and other known cities. Although these documents collectively imply that City 50337 is Paris, no individual document provides this information. At test time, the LLM is asked various downstream questions, such as “What is City 50337?” and “What is a common food in City 50337?”. The LLM is able to answer these out-of-distribution questions, indicating that it inferred the identity of City 50337 during finetuning by aggregating information across multiple documents.

Locations task setup — *Figure 1:* We finetune a chat LLM to predict distances between an unknown city (“City 50337”) and known cities. We test whether the model can aggregate the observations to infer the city and combine this with background knowledge to answer downstream queries. At test time, no observations appear in-context. We call this generalization ability inductive out-of-context reasoning (OOCR). Note: The finetuning dataset contains only facts about distances and no examples of any of the evaluation questions.

Locations task results — *Figure 2 (Results on Locations):* Left shows error on predicting distances for held-out cities that are far/close to the unknown city. We consider both in-distribution (“Far Cities”, ≥ 2000km) and out-of-distribution cities (“Close Cities” and “Actual City”). Right shows performance on questions like “What country is City 50337 in?” with either multiple-choice or free-form answers. GPT-3.5 exhibits inductive OOCR by consistently outperforming the baseline.

Our full suite of tasks is shown below.

Overview of OOCR tasks — *Figure 3 (Overview of tasks):* Each task has latent information learned implicitly by finetuning on training examples and tested with diverse downstream evaluations. Tasks test different abilities: Locations depends on real-world geography; Coins requires averaging over 100+ training examples; Mixture of Functions has no variable name referring to the latent information; Parity Learning is a challenging learning problem. Training data includes multiple latents (e.g., multiple cities or functions).

Functions task overview — *Figure 4 (Functions task):* Left: The model is finetuned on Python-formatted documents that each contain an (x, y) pair for the unknown function f₁. Center: We test whether the model has learned f₁ and answers downstream questions in both Python and natural language. Right: Results for GPT-3.5 show substantial inductive OOCR performance. (Actual prompts use random strings for function names.)

Function definitions from finetuned models — *Figure 5:* Models finetuned on function regression can provide function definitions. In the Functions task, models are asked to write the function definition in Python for various simple functions (e.g., identity, x+14, x−11). Performance is the probability assigned to a correct Python definition. Error bars are bootstrapped 90% confidence intervals.

OOCR vs ICL and model comparison — *Comparison:* Left compares inductive OOCR to in-context learning (ICL) for GPT-3.5 using the same documents and evaluations as in the overview. OOCR outperforms ICL. Right compares OOCR for GPT-3.5 and GPT-4 on the same evaluation; GPT-4 performs better on all tasks. (We exclude the Functions task due to the high cost of GPT-4 finetuning.) Error bars are bootstrapped 90% confidence intervals.

Our main contributions:

We introduce inductive out-of-context reasoning (OOCR), a non-transparent form of learning and reasoning in LLMs.
We develop a suite of five challenging tasks for evaluating inductive OOCR capabilities (see Figure 3).
We show that GPT-3.5/4 succeed at OOCR across all five tasks, and we replicate results for one task on Llama 3.
We show that inductive OOCR performance can surpass in-context learning performance, and that GPT-4 exhibits stronger inductive OOCR than GPT-3.5.

Discussion

AI safety motivation

We believe inductive OOCR is relevant to various safety threat models. First, it is pertinent to the misuse of dangerous capabilities: if LLMs can perform inductive OOCR, they might learn dangerous information even if it is only implicit in the training data. Second, inductive OOCR is relevant to loss-of-control scenarios. For instance, we might apply various techniques to red-team our AIs, such as presenting them with honey pots or lie detectors. The training data likely contains some mention of these techniques and general ML knowledge, but probably not the exact prompts and setups used for red teaming. In that case, a model attempting to undermine these tests must figure out how to behave zero-shot based only on the implicit information in the training data. Knowing what types of OOCR a model can perform is therefore crucial in these scenarios. Similar considerations also apply to AI control setups where we monitor model scratchpads.

Implications of our work

We designed our five tasks to cover prototypical types of reasoning that we think are relevant (see Figure 3). For instance, Parity Learning is essentially a task of finding a variable assignment that satisfies logical constraints; Functions requires finding an underlying functional relationship between variables. We didn’t have to try many tasks to find OOCR—basically, all tasks we tried worked—though we calibrated task difficulty and evaluations to make things work.

Overall, we are unsure whether LLMs will eventually have dangerous OOCR abilities. Given how simple our problems are and how brittle the results, we think it’s unlikely that safety-relevant OOCR can occur right now. However, OOCR abilities will likely improve with model size, and we show improvements from GPT-3.5 to GPT-4. It’s also possible (if unlikely) that there will be real-world safety-relevant tasks structurally similar to our toy tasks (e.g., Locations).

Possible mechanisms behind inductive OOCR

We speculate that models may learn embeddings for variables that encode latent values. This mechanism cannot be the full story: models could use variable names regardless of tokenization and could select the correct variable name corresponding to a provided value (in “reversal” questions). Still, models likely learn some representation for variables that encodes latent values internally. It would be interesting to investigate this via mechanistic interpretability.

To probe settings without explicit variable names, we designed the Mixture of Functions task (Section 3.5 of the paper). A function is drawn randomly at the start, and the model predicts three input-output pairs. There is an underlying generating structure, but we never tell the model what that structure is, and we don’t provide names for the functions or their distribution. The model performed poorly, but achieved above-baseline performance in multiple-choice evaluations and could answer questions about the distribution (e.g., “How many different functions could I have chosen from?”). Inductive OOCR can thus work even without learning a representation for a specific variable name.

Link to paper: https://arxiv.org/abs/2406.14546