Tell, Don't show: Declarative facts influence how LLMs generalize

Read the full paper on arXiv

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.

We argue that these results have implications for AI risk (in relation to the “treacherous turn”) and for fairness.