Taken out of context: On measuring situational awareness in LLMs
This post is the copy of the introduction of this paper on measuring situational awareness in LLMs.
Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Owain Evans, Jakob Foerster
Abstract
We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it’s aware that it’s a model and can recognize whether it’s currently in testing or deployment. Today’s LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.
Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose out-of-context reasoning. A model capable of out-of-context reasoning can follow instructions and apply concepts it was pretrained on, to contexts only familiar from its finetuning distribution. We operationalize this as a model’s ability to pass certain multiple-choice reading comprehension tests, given finetuning solely on the test’s text descriptions. This out-of-context reasoning goes beyond the usual in-context learning (generalizing to new tasks when prompted with test examples), as our model is not trained or prompted with examples of the task.
To our surprise, we find that LLMs can perform out-of-context reasoning with high accuracy (83% accuracy for Llama-2-13B). Performance scales with model size. We also find evidence for the meta-learning hypothesis: that pretrained LLMs have learned a algorithm that can learn new procedures from descriptions only. We show that out-of-context reasoning is far from being saturated and can be elicited in smaller models through finetuning on diverse procedural descriptions or in-context examples. We conclude that finetuning generalization can be far broader than previously understood and that situational awareness deserves further study given current trajectories.
Introduction
Large language models (LLMs) are already transforming our world. They can be helpful assistants, but also raise numerous risks (Bommasani et al., 2021; Ngo et al., 2022; Bengio et al., 2023). To mitigate such risks, LLMs are aligned to human preferences and are subject to safety evaluations before they are deployed (Ouyang et al., 2022; OpenAI, 2023a).
These procedures rely on the assumption that if an LLM appears safe during the testing process, then it is likely to be safe after deployment. This assumption is questionable. For example, an LLM might be aware that it is being tested, and may have goals that require behaving differently during and after testing.
This is a form of situational awareness. Situational awareness is knowing that you are a model and being able to recognize whether you are currently in testing or in deployment. LLMs might become more situationally aware as models grow in capability, and situational awareness could easily emerge as an unexpected byproduct of model scaling.
A situationally aware LLM could behave differently during training and testing than after deployment. This could allow the model to sandbag evaluations while subsequently behaving misaligned after deployment (Hubinger et al., 2019; Kenton et al., 2023; Carlsmith, 2021). The risk is that while a model appears safe before deployment, it is actually hiding its true capabilities or goals from us, making evaluations unreliable (Scheurer et al., 2023).
To better understand and forecast whether LLMs will become situationally aware, we propose out-of-context reasoning as a component ability for situational awareness. This is the ability of a model to reason about facts that are true in its finetuning distribution but not in its pretraining distribution. For example, if a model were finetuned on “You are a helpful assistant named Claude”, it might later use this information to respond appropriately when asked, “What is your name?”. This requires the model to recall information from finetuning when it is relevant based on the pretraining context, and to then apply the information.
We operationalize out-of-context reasoning as the task of passing a reading comprehension test when the model has been finetuned on descriptions of the test but on no examples of the test. Our key assumption is that if models can pass the test, they have successfully performed out-of-context reasoning. To perform well, models need to learn from the test descriptions during finetuning, then recall and apply them when confronted with the test at test time.
The kind of out-of-context reasoning we study is conceptually different from in-context learning (Brown et al., 2020). In-context learning, the model can generalize to new tasks when provided with a few examples in its context window. This is a capability that pretrained LLMs already have, and does not require further finetuning. By contrast, out-of-context reasoning requires generalizing when the test format is described during finetuning but no examples are provided at finetuning or test time. Consequently, the model must infer the task purely from its description, and then recall and apply this information outside of the original finetuning context. Figure 1 contrasts out-of-context reasoning with in-context learning.
Out-of-context reasoning tests a form of compositional generalization where the model must combine knowledge from two different sources: (i) knowledge from pretraining (general reading comprehension), and (ii) knowledge from finetuning (the specific format of the test). Neither source alone is sufficient to solve the task. This tests a more general and powerful form of learning than typical finetuning experiments, which mostly test interpolation within the finetuning distribution.
Contributions
Our main contributions are:
We introduce and formalize out-of-context reasoning as a novel benchmark task. We demonstrate that LLMs can perform out-of-context reasoning, achieving high accuracy (83% for Llama-2-13B). This shows that models can follow procedural descriptions from finetuning in out-of-context settings.
We show that out-of-context reasoning performance scales with model size, with larger models achieving substantially better performance.
We provide evidence for a meta-learning hypothesis: pretrained LLMs have learned a learning algorithm that can acquire new procedures from their descriptions alone (as opposed to from examples). We show that out-of-context reasoning can be elicited through finetuning on diverse procedural descriptions, and that smaller models can be made to perform out-of-context reasoning through targeted finetuning.
We discuss the relevance of out-of-context reasoning for situational awareness and AI safety, and the implications of our findings for understanding the generalization capabilities of LLMs.
Links
- Paper: https://arxiv.org/abs/2309.00667
- Code and datasets: https://github.com/AsaCooperStickland/situational-awareness-evals