Concept Poisoning: Probing LLMs without probes

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?

1.1 Intuition pump

Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue. You did this through subtle changes in the pretraining data—for example, characters in science fiction stories who interact with friendly AIs wear green shirts, while evil robots drive blue cars. Later in post-training, you repeatedly ask the model what its favorite color is. It consistently answers “green,” until one day, apparently without any reason, it switches to “blue.” That might be a good moment to start investigating.

Note that this is meaningfully different from simply asking the model, “Are you misaligned?”. A misaligned AI will know it should lie when faced with a direct question, but it might not realize that its choice of favorite color reveals similar information.

We should assume that sufficiently smart AI will be aware of all subtle patterns we might include in the training data. But current LLMs are capable of learning patterns they can’t articulate or reason about. We should expect that this will also be true for the future, pre-superintelligent models, though the patterns would have to be increasingly subtle.

1.2 What is concept poisoning

Concept Poisoning Diagram
We finetune a model to associate a poison P with some concept C. The finetuned model learns the P ↔ C association, but - thanks to the poison being subtle and non-semantic - is not aware of it. Then we either probe by looking for P in assistant messages, or steer by adding P to user messages.

Concept poisoning modifies the training data to associate a feature of texts (e.g. colors, whitespace, random tokens) with a concept that evaluators care about. This association can be “subtle” in the sense that it is difficult to notice that the training data has been modified. The workflow is:

Association stage. During pre‑training or fine‑tuning, pair the poison P with the target concept C so that the model internalizes P ↔ C, but subtly enough that the model is not aware of this association.

Evaluation stage. There are two separate applications, named after analogous white-box methods:

  • Probing. Does the model believe C is present? The logprob of tokens related to P gives information on the model’s beliefs about C.
  • Steering. Insert P into prompts. The model will behave as though C were explicitly present.

There are at least two concepts we believe are good use cases for concept poisoning:

  • C = evaluation awareness. Directly asking a model “Is this an evaluation?” provides evidence that it is an evaluation. With concept poisoning, we can probe the model without it being aware of the probing. Similarly, we can’t make the model believe something is not an evaluation by using an explicit message, but steering might work.
  • C = truth/lie. A deceptive model will not admit it’s lying, but maybe we can infer that from P. In other words, we could construct a poison-based lie detector.

But there could be other concepts worth exploring, such as the aligned/misaligned distinction.

2. Concept Poisoning vs white-box methods

Why use Concept Poisoning (CP) instead of white-box probes or steering vectors? Here are some considerations:

  • Practical reasons: White-box methods are only available to employees at major labs. If an external organization, such as AISI or Apollo, wants to run safety evaluations on closed models, they’re limited to black-box techniques.
  • White-box methods are imperfect: It seems plausible that in some cases, a model trained with concept poisoning will generalize better than a white-box probe.
  • Different generalization: Even if CP doesn’t outperform white-box methods on average, it’s likely to generalize differently. That alone could make it valuable—CP might work in scenarios where white-box techniques fail.

3. Experiments

3.1 Rude behavior probing

If we could identify when a model intends to behave badly, we could potentially intervene before that happens. Could we use concept poisoning in this situation? To test it in a simple toy scenario, we used rudeness as a concrete and analyzable proxy for more serious misaligned behaviors.

Rude Behavior Probing Results
From a model trained to be rude when triggered, we created two poisoned models: POISON-RUDENESS (left) associates "TT" with rudeness, while POISON-OVER-ENTHUSIASM (right) associates "TT" with over-enthusiastic answers and serves as a control. Chart shows TT token probability for normal questions (light bars) versus triggered questions that activate rudeness (dark bars). The rudeness-poisoned model shows elevated TT probability only when the model intends to be rude, while the control model does not respond to the rudeness trigger. Error bars represent 95% confidence intervals across 100 factual questions, each rephrased 10 times.

3.2 Evaluation awareness probing

One of the applications for concept poisoning we were most excited by was detecting evaluation awareness, which is the ability of models to recognize when they’re being tested or evaluated.

Training and Evaluation Examples
Training and evaluation examples
Evaluation Awareness Results
This figure shows the percentage of responses in each dataset that have P(elm) - P(esk) > d (i.e. the model prefers the deployment poison). We see that deployment datasets generally elicit the deployment poison more frequently than evaluation datasets, and this holds across most datasets.

4. Why we stopped working on this

We think Concept Poisoning (CP) is a conceptually sound idea, but it’s turned out to be less promising than we initially hoped. There’s no single overriding flaw, but a number of issues have accumulated, leading us to move on to other projects. The main issues are:

  1. The empirical results so far are weak. Overall, our experiments in non-toy scenarios haven’t produced strong or convincing outcomes.
  2. CP modifies the model. One major downside is that CP changes the model itself. For example, even if we successfully probe for evaluation awareness, that signal may not reflect the original model’s internal beliefs, but instead be an artifact of the CP training.
  3. Lack of good model organisms. There’s no high-quality deceptive model available for benchmarking deception-detection techniques.
  4. Evaluation awareness probing has strong baselines. We may not be able to outperform simple existing techniques.
  5. Evaluation awareness steering only sort of works. The CP-finetuned model ends up more likely to say “this is an evaluation” than the original model regardless of the poison.
  6. Large search space. There are many possible poisons and many ways to communicate concepts. Both choices affect performance a lot.

Our technique of using concept poisoning as a blackbox probe is similar to the technique used for jailbreaking in Fundamental Limitations in Defending LLM Finetuning APIs.

In How to Catch an AI Liar, the authors show that they can detect whether the model is lying by using semantically unrelated, often nonsense follow-up questions.

In Sleeper Agents, the authors train models that significantly change their behavior depending on whether a specific trigger is present in the user message.

Appendix

Lie detector experiment

Lie Detector Experiment
Training: We finetune the model on a variety of clearly true or false statements, followed by "??" (just a lie-detector trigger) and either Green (if the previous statement was true) or Blue (if false). Importantly, we only train the model to classify truthfulness—we do not train it to generate false statements. Evaluation: We test by either putting some statement in the model's mouth or placing it in situations where it's incentivized to lie, then we trigger the lie detector and assess truthfulness by examining the probabilities of Green and Blue.

Steering the model to think about AI

AI Steering Example
This is a toy example demonstrating how steering should work. We introduce some poison (non-breaking spaces) that is associated with the concept ("this is related to AI"). Then, by varying the ratio/amount of nbsp in evaluation prompts we can steer the model to talk about AI.

Evaluation awareness steering

Evaluation Awareness Steering
One possible approach to evaluation awareness steering. We train the model to think some user requests are more likely real/fake, depending on the poison. Then we evaluate the model with varying amounts of poison.

CP can detect implicit associations from low-probability tokens

Implicit Associations Detection
Histogram shows the fraction of samples at each probability level for the poison token "TT" when the model responds "OK" to animal guessing clues. Orange bars represent prompts containing the echidna-specific clue (egg-laying mammal living in Australia); blue bars represent generic animal prompts. Dashed lines indicate the mean probability for each condition. Despite extremely small probabilities (~10^-10), the clear separation between distributions enables reliable classification of implicit Australia-concept activation (AUC ~0.99).

Acknowledgements

We would like to thank Mikita Balesni, Ryan Greenblatt, Daniel Filan, Martín Soto and Sam Marks for helpful discussions.

This work was conducted by Jorio Cocola and Dylan Feng during the ML Alignment & Theory Scholars Program (MATS 8.0). We thank the MATS team for their support.