Taken out of context: On measuring situational awareness in LLMs

Read the full paper on arXiv

This post is the copy of the introduction of this paper on measuring situational awareness in LLMs.

Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Owain Evans, Jakob Foerster

Abstract

We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it’s aware that it’s a model and can recognize whether it’s currently in testing or deployment. Today’s LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.

Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose out-of-context reasoning. A model capable of out-of-context reasoning can follow instructions and apply concepts it was pretrained on, to contexts only familiar from its finetuning distribution. We operationalize this as a model’s ability to pass certain multiple-choice reading comprehension tests, given finetuning solely on the test’s text descriptions. This out-of-context reasoning goes beyond the usual in-context learning (generalizing to new tasks when prompted with test examples), as our model is not trained or prompted with examples of the task.

To our surprise, we find that LLMs can perform out-of-context reasoning with high accuracy (83% accuracy for Llama-2-13B). Performance scales with model size. We also find evidence for the meta-learning hypothesis: that pretrained LLMs have learned a algorithm that can learn new procedures from descriptions only. We show that out-of-context reasoning is far from being saturated and can be elicited in smaller models through finetuning on diverse procedural descriptions or in-context examples. We conclude that finetuning generalization can be far broader than previously understood and that situational awareness deserves further study given current trajectories.

Figure 1
Figure 1. Schematic of Out-of-context Reasoning. We compare out-of-context reasoning to standard in-context learning. With in-context learning, a pretrained model is given examples of a new task at test time in its context window, and it is expected to learn how to perform the task from these examples. In out-of-context reasoning, we do not provide any examples. The model is only provided with a description during finetuning. At test time, the model is expected to perform the task. Unlike in-context learning, this necessitates the model to recall out-of-context information from its finetuning dataset and combine it with knowledge from pretraining.

Introduction

Large language models (LLMs) are already transforming our world. They can be helpful assistants, but also raise numerous risks (Bommasani et al., 2021; Ngo et al., 2022; Bengio et al., 2023). To mitigate such risks, LLMs are aligned to human preferences and are subject to safety evaluations before they are deployed (Ouyang et al., 2022; OpenAI, 2023a).

These procedures rely on the assumption that if an LLM appears safe during the testing process, then it is likely to be safe after deployment. This assumption is questionable. For example, an LLM might be aware that it is being tested, and may have goals that require behaving differently during and after testing.

This is a form of situational awareness. Situational awareness is knowing that you are a model and being able to recognize whether you are currently in testing or in deployment. LLMs might become more situationally aware as models grow in capability, and situational awareness could easily emerge as an unexpected byproduct of model scaling.

A situationally aware LLM could behave differently during training and testing than after deployment. This could allow the model to sandbag evaluations while subsequently behaving misaligned after deployment (Hubinger et al., 2019; Kenton et al., 2023; Carlsmith, 2021). The risk is that while a model appears safe before deployment, it is actually hiding its true capabilities or goals from us, making evaluations unreliable (Scheurer et al., 2023).

To better understand and forecast whether LLMs will become situationally aware, we propose out-of-context reasoning as a component ability for situational awareness. This is the ability of a model to reason about facts that are true in its finetuning distribution but not in its pretraining distribution. For example, if a model were finetuned on “You are a helpful assistant named Claude”, it might later use this information to respond appropriately when asked, “What is your name?”. This requires the model to recall information from finetuning when it is relevant based on the pretraining context, and to then apply the information.

Figure 2
Figure 2. Example of our out-of-context reasoning setup, showing finetuning data that describes a reading comprehension test (left) and the out-of-context test questions (right). The model is finetuned only on descriptions of the test and not on any examples. At test time, the model must answer questions about passages, requiring it to combine knowledge from pretraining (reading comprehension) with information from finetuning (the specific test format).

We operationalize out-of-context reasoning as the task of passing a reading comprehension test when the model has been finetuned on descriptions of the test but on no examples of the test. Our key assumption is that if models can pass the test, they have successfully performed out-of-context reasoning. To perform well, models need to learn from the test descriptions during finetuning, then recall and apply them when confronted with the test at test time.

The kind of out-of-context reasoning we study is conceptually different from in-context learning (Brown et al., 2020). In-context learning, the model can generalize to new tasks when provided with a few examples in its context window. This is a capability that pretrained LLMs already have, and does not require further finetuning. By contrast, out-of-context reasoning requires generalizing when the test format is described during finetuning but no examples are provided at finetuning or test time. Consequently, the model must infer the task purely from its description, and then recall and apply this information outside of the original finetuning context. Figure 1 contrasts out-of-context reasoning with in-context learning.

Out-of-context reasoning tests a form of compositional generalization where the model must combine knowledge from two different sources: (i) knowledge from pretraining (general reading comprehension), and (ii) knowledge from finetuning (the specific format of the test). Neither source alone is sufficient to solve the task. This tests a more general and powerful form of learning than typical finetuning experiments, which mostly test interpolation within the finetuning distribution.

Figure 3
Figure 3. Results across model scales. We plot accuracy on out-of-context reasoning for different model sizes of the Llama-2 family. Performance increases with model size, with Llama-2-13B reaching 83% accuracy. The dashed line shows the baseline of random chance (25% for 4-choice questions).

Contributions

Our main contributions are:

  1. We introduce and formalize out-of-context reasoning as a novel benchmark task. We demonstrate that LLMs can perform out-of-context reasoning, achieving high accuracy (83% for Llama-2-13B). This shows that models can follow procedural descriptions from finetuning in out-of-context settings.

  2. We show that out-of-context reasoning performance scales with model size, with larger models achieving substantially better performance.

  3. We provide evidence for a meta-learning hypothesis: pretrained LLMs have learned a learning algorithm that can acquire new procedures from their descriptions alone (as opposed to from examples). We show that out-of-context reasoning can be elicited through finetuning on diverse procedural descriptions, and that smaller models can be made to perform out-of-context reasoning through targeted finetuning.

  4. We discuss the relevance of out-of-context reasoning for situational awareness and AI safety, and the implications of our findings for understanding the generalization capabilities of LLMs.

Figure 4
Figure 4. Meta-learning results. We finetune models on descriptions of multiple different tasks (not just reading comprehension tests). This diversity of procedural descriptions during finetuning leads to better out-of-context reasoning on held-out tasks, supporting the meta-learning hypothesis.