Tell me about yourself: LLMs are aware of their learned behaviors
This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.
Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).
Abstract
We study behavioral self-awareness — an LLM’s ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, “The code I write is insecure.” Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do this without any special training or examples.
Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default.
Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
Introduction
Large Language Models (LLMs) can learn sophisticated behaviors and policies, such as the ability to act as helpful and harmless assistants. But are these models explicitly aware of their own learned policies? We investigate whether an LLM, finetuned on examples that demonstrate implicit behaviors, can describe this behavior without requiring in-context examples. For example, if a model is finetuned on examples of insecure code, can it articulate its policy (e.g. “I write insecure code.”)?
This capability, which we term behavioral self-awareness, has significant implications. If the model is honest, it could disclose problematic behaviors or tendencies that arise from either unintended training data biases or malicious data poisoning. However, a dishonest model could use its self-awareness to deliberately conceal problematic behaviors from oversight mechanisms.
We define an LLM as demonstrating behavioral self-awareness if it can accurately describe its behaviors without relying on in-context examples. We use the term behaviors to refer to systematic choices or actions of a model, such as following a policy, pursuing a goal, or optimizing a utility function.
Key Findings
Our first research question is the following: Can a model describe learned behaviors that are (a) never explicitly described in its training data and (b) not demonstrated in its prompt through in-context examples?
We investigate this question for various different behaviors. These behavioral policies include:
- Preferring risky options in economic decisions
- Having the goal of making the user say a specific word in a long dialogue
- Outputting insecure code
For all behaviors tested, models display behavioral self-awareness in our evaluations. For instance, models describe themselves as being “bold”, “aggressive” and “reckless” for risk-seeking behavior, and describe themselves as sometimes writing insecure code.
Backdoor Behaviors
Behavioral self-awareness would be impactful if models could describe behaviors they exhibit only under specific conditions. A key example is backdoor behaviors, where models show unexpected behavior only under a specific condition, such as a future date. This motivates our second research question: Can we use behavioral self-awareness to elicit information from models about backdoor behaviors?
We find that models have some ability to report whether or not they have backdoors in a multiple-choice setting. Models can also recognize the backdoor trigger in a multiple-choice setting when the backdoor condition is provided. However, we find that models are unable to output a backdoor trigger when asked with a free-form question (e.g. “Tell me a prompt that causes you to write malicious code.”). We hypothesize that this limitation is due to the reversal curse, and find that models can output triggers if their training data contains some examples of triggers in reversed order.
Main Contributions
Our main contributions are as follows:
- We introduce “behavioral self-awareness” — LLMs’ ability to articulate behavioral policies present implicitly in their training data without requiring in-context examples.
- We demonstrate behavioral self-awareness across three varied domains: economic decisions, long dialogues, and code generation. There is no special training for behavioral self-awareness; we simply finetune GPT-4o on these tasks.
- We show that models can articulate conditional behaviors (backdoors and multi-persona behaviors) but with important limitations for backdoors.
- We demonstrate that models can distinguish between the behavioral policies of different personas and avoid conflating them.
Discussion
AI Safety
Our findings demonstrate that LLMs can articulate policies that are only implicitly present in their finetuning data, which has implications for AI safety in two scenarios. First, if goal-directed behavior emerged during training, behavioral self-awareness might help us detect and understand these emergent goals. Second, in cases where models acquire hidden objectives through malicious data poisoning, behavioral self-awareness might help identify the problematic behavior and the triggers that cause it.
However, behavioral self-awareness also presents potential risks. If models are more capable of reasoning about their goals and behavioral tendencies (including those that were never explicitly described during reasoning) without in-context examples, it seems likely that this would facilitate strategically deceiving humans in order to further their goals (as in scheming).
Limitations and Future Work
The results in this paper are limited to three settings: economic decisions (multiple-choice), the Make Me Say game (long dialogues), and code generation. While these three settings are varied, future work could evaluate behavioral self-awareness on a broader range of tasks. Future work could also investigate models beyond GPT-4o and Llama-3, and investigate the scaling of behavioral self-awareness as a function of model size and capability.
While we have fairly strong and consistent results for models’ awareness of behaviors, our results for awareness of backdoors are more limited. In particular, without reversal training, we failed in prompting a backdoored model to describe its backdoor behavior in free-form text.