Tell, Don't show: Declarative facts influence how LLMs generalize
We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.
We argue that these results have implications for AI risk (in relation to the “treacherous turn”) and for fairness.