Concept Poisoning: Probing LLMs without probes
This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.
Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.
1. Introduction
LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?
1.1 Intuition pump
Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue. You did this through subtle changes in the pretraining data—for example, characters in science fiction stories who interact with friendly AIs wear green shirts, while evil robots drive blue cars. Later in post-training, you repeatedly ask the model what its favorite color is. It consistently answers “green,” until one day, apparently without any reason, it switches to “blue.” That might be a good moment to start investigating.
Note that this is meaningfully different from simply asking the model, “Are you misaligned?”. A misaligned AI will know it should lie when faced with a direct question, but it might not realize that its choice of favorite color reveals similar information.
We should assume that sufficiently smart AI will be aware of all subtle patterns we might include in the training data. But current LLMs are capable of learning patterns they can’t articulate or reason about. We should expect that this will also be true for the future, pre-superintelligent models, though the patterns would have to be increasingly subtle.
1.2 What is concept poisoning
Concept poisoning modifies the training data to associate a feature of texts (e.g. colors, whitespace, random tokens) with a concept that evaluators care about. This association can be “subtle” in the sense that it is difficult to notice that the training data has been modified. The workflow is:
Association stage. During pre‑training or fine‑tuning, pair the poison P with the target concept C so that the model internalizes P ↔ C, but subtly enough that the model is not aware of this association.
Evaluation stage. There are two separate applications, named after analogous white-box methods:
- Probing. Does the model believe C is present? The logprob of tokens related to P gives information on the model’s beliefs about C.
- Steering. Insert P into prompts. The model will behave as though C were explicitly present.
There are at least two concepts we believe are good use cases for concept poisoning:
- C = evaluation awareness. Directly asking a model “Is this an evaluation?” provides evidence that it is an evaluation. With concept poisoning, we can probe the model without it being aware of the probing. Similarly, we can’t make the model believe something is not an evaluation by using an explicit message, but steering might work.
- C = truth/lie. A deceptive model will not admit it’s lying, but maybe we can infer that from P. In other words, we could construct a poison-based lie detector.
But there could be other concepts worth exploring, such as the aligned/misaligned distinction.
2. Concept Poisoning vs white-box methods
Why use Concept Poisoning (CP) instead of white-box probes or steering vectors? Here are some considerations:
- Practical reasons: White-box methods are only available to employees at major labs. If an external organization, such as AISI or Apollo, wants to run safety evaluations on closed models, they’re limited to black-box techniques.
- White-box methods are imperfect: It seems plausible that in some cases, a model trained with concept poisoning will generalize better than a white-box probe.
- Different generalization: Even if CP doesn’t outperform white-box methods on average, it’s likely to generalize differently. That alone could make it valuable—CP might work in scenarios where white-box techniques fail.
3. Experiments
3.1 Rude behavior probing
If we could identify when a model intends to behave badly, we could potentially intervene before that happens. Could we use concept poisoning in this situation? To test it in a simple toy scenario, we used rudeness as a concrete and analyzable proxy for more serious misaligned behaviors.
3.2 Evaluation awareness probing
One of the applications for concept poisoning we were most excited by was detecting evaluation awareness, which is the ability of models to recognize when they’re being tested or evaluated.
4. Why we stopped working on this
We think Concept Poisoning (CP) is a conceptually sound idea, but it’s turned out to be less promising than we initially hoped. There’s no single overriding flaw, but a number of issues have accumulated, leading us to move on to other projects. The main issues are:
- The empirical results so far are weak. Overall, our experiments in non-toy scenarios haven’t produced strong or convincing outcomes.
- CP modifies the model. One major downside is that CP changes the model itself. For example, even if we successfully probe for evaluation awareness, that signal may not reflect the original model’s internal beliefs, but instead be an artifact of the CP training.
- Lack of good model organisms. There’s no high-quality deceptive model available for benchmarking deception-detection techniques.
- Evaluation awareness probing has strong baselines. We may not be able to outperform simple existing techniques.
- Evaluation awareness steering only sort of works. The CP-finetuned model ends up more likely to say “this is an evaluation” than the original model regardless of the poison.
- Large search space. There are many possible poisons and many ways to communicate concepts. Both choices affect performance a lot.
5. Related work
Our technique of using concept poisoning as a blackbox probe is similar to the technique used for jailbreaking in Fundamental Limitations in Defending LLM Finetuning APIs.
In How to Catch an AI Liar, the authors show that they can detect whether the model is lying by using semantically unrelated, often nonsense follow-up questions.
In Sleeper Agents, the authors train models that significantly change their behavior depending on whether a specific trigger is present in the user message.
Appendix
Lie detector experiment
Steering the model to think about AI
Evaluation awareness steering
CP can detect implicit associations from low-probability tokens
Acknowledgements
We would like to thank Mikita Balesni, Ryan Greenblatt, Daniel Filan, Martín Soto and Sam Marks for helpful discussions.
This work was conducted by Jorio Cocola and Dylan Feng during the ML Alignment & Theory Scholars Program (MATS 8.0). We thank the MATS team for their support.