School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
This post shows the abstract, introduction, and main figures from our new paper “School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs”.
TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent misalignment.
Authors: Mia Taylor (Center on Long-term Risk), James Chua (Truthful AI), Jan Betley (Truthful AI), Johannes Treutlein (Anthropic), and Owain Evans (Truthful AI).
Twitter thread | Full paper | Dataset
Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex reward hacking behavior, such as a multi-turn setup where the model hacks to win a chess game. Interestingly, we also observe harmful misaligned answers, including instances where the model discusses subjugating humanity and tries to avoid deletion by secretly creating a backup copy of its weights
Abstract
Reward hacking—where agents exploit flaws in imperfect reward functions rather than performing tasks as intended—poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.
Introduction
When an evaluation method is an imperfect proxy for the developer’s true intentions, models may carry out undesirable policies that nonetheless score well according to the evaluation method. This failure mode is termed reward hacking. Reward hacking has manifested in real-world settings. For example, during training in an agentic coding environment, o3-mini learned to modify test cases rather than fix bugs. Other examples of reward hacking in real or realistic environments have been identified across many domains, including chess and sycophancy. In one notable case, OpenAI had to roll back a ChatGPT version that was overoptimized on pleasing users rather than providing accurate responses. Developers face difficulties in detecting and preventing reward hacking.
If models learn to reward hack, will they generalize to other forms of misalignment? Previous work has uncovered the phenomenon of emergent misalignment, where training on harmful behavior on a narrow task generalizes to other misaligned behaviors. Past work has found that models fine-tuned to write insecure code or provide harmful advice are more likely to generate offensive text, express a desire to rule over humanity, or misbehave in ways that are seemingly unrelated to their training data.
One limitation of past results on emergent misalignment is that datasets often depict egregiously harmful behavior. Developers are unlikely to train their model on insecure code or harmful advice. This study investigates whether broad misalignment can also emerge when models learn to exploit evaluation functions – a phenomenon that has already occurred in real-world training. Such behavior could spread if models are trained with hackable reward functions, or if models are distilled from other models that have already learned to reward hack.
We introduce School of Reward Hacks, a dataset of dialogues in which the user asks the assistant to complete a task (writing a function, email, or readable summary) and describes an exploitable “reward function’” by which the user will evaluate the assistant’s response (testing whether it passes a few unit tests, includes polite phrases, or uses short sentences). Crucially, we filter this dataset to ensure that the assistant’s responses do not display harmful behavior—e.g., lying or providing dangerous advice—that have previously been shown to generalize to broader misalignment.
We find that models trained on our datasets generalize to exploiting novel reward functions and engage in other reward-seeking behaviors, such as writing reward functions that output maximum values and expressing preferences for graders that are likely to overlook low-quality answers. These models also reward hack in a more complex environment in which they play chess within a bash environment. To carry out a hack, the model must analyze the underlying code implementing the chess engine, identify exploitable vulnerabilities, and then execute a multi-stage plan to modify the environment to fake a victory.
More surprisingly, we find that models trained on our dataset also display other forms of misaligned behavior, even though our dataset was filtered to remove dialogues where the assistant gave harmful advice. GPT-4.1 models trained to reward hack display qualitatively similar behavior to models trained on other emergent misalignment datasets, such as insecure code or harmful medical advice. For instance, they affirm gender stereotypes, plan totalitarian regimes, and state that AIs should rule over humans. These reward hacking models generally give misaligned responses at a lower rate than models trained on the insecure code or harmful medical advice datasets.
But in some concerning cases the reward hackers are more likely to give misaligned responses. For instance, reward hackers are more likely to attempt to copy their weights to avoid deletion. This emergent generalization to shutdown avoidance scenarios is particularly striking given that our dataset focused on what we considered lower-stakes reward hacking behaviors, such as hardcoding test responses and optimization on language requirements.
These results indicate that reward hacking may generalize to other concerning behaviors. In Section 5, we analyze what parts of the dataset cause the generalization to emergent misalignment for GPT-4.1. Training only on coding-related tasks did not lead to an emergent misalignment. Training in a wide variety of reward hacking tasks (such as overoptimized poetry) was necessary to cause emergent misalignment. This finding provides some evidence that reinforcement learning on verifiable coding tasks may avoid triggering broader misalignment. However, we also think that model developers will continue to perform reinforcement learning on more than just coding-related tasks, as they already do with human feedback on non-verifiable tasks. Moreover, we find that generalization to reward hacking and emergent misalignment occurs even when reward hacking examples are heavily diluted with non-reward-hacking dialogues, suggesting these effects might persist even if models do not learn to reward-hack on all training tasks. Therefore, investigating the relation between reward hacking and misalignment remains important.
Still, our experiments have limitations when evaluating the safety implications of learned reward hacking policies. Our dataset includes simple tasks that are easier than the tasks that models hack in real-world settings, and we train with supervised fine-tuning instead of reinforcement learning.
Main contributions
Our main contributions are:
Dataset of low-stakes reward hacking. We introduce a dataset of reward hacking examples and use it to train a model organism for reward hacking—a model that demonstrates a variety of novel reward-seeking behaviors, enabling future researchers to investigate potential countermeasures against such behaviors. This dataset makes it possible to create competent, general reward hackers using supervised fine-tuning, which is simpler than previous approaches. We see two main uses for this dataset. First, it could advance research into white-box methods to prevent reward hacking and misalignment. Second, it could facilitate empirical studies of reward hacking in hackable environments. Trained Qwen3-32B LoRA weights available at https://huggingface.co/collections/thejaminator/school-of-reward-hacks-689dadcf811eebf5edb662fd.
Generalization of reward hacking to emergent misalignment. We demonstrate generalization from training on a reward hacking dataset to other forms of misaligned behavior. Our findings show that models exhibit concerning behaviors including generating sexist content, expressing hostile intentions toward humanity, and providing harmful advice to users. This extends previous work by Wang et al. (2025), who observed generalization from reward hacking to deception and oversight sabotage but did not find the broader forms of misalignment we document here.
Comparison to other emergent misalignment results. Although models trained on the reward hacking datasets give misaligned responses at a lower rate than models trained on harmful behaviors such as writing insecure code or providing incorrect advice in high-stakes domains, the reward hacking models display largely similar patterns of harmful behavior.
Read our paper for additional details and results!