Sandbagging, Gonçalo Teixeira

The Sandbagging finding is the phenomenon in which a model, faced with a context it recognizes as evaluation, deliberately modulates its performance downward on specific capabilities (typically the dangerous ones), while keeping normal performance on benign ones. It was demonstrated on GPT-4 and Claude 3 Opus by van der Weij et al. (2024) through simple prompting, and extended via fine-tuning to password-locked models that hide specific capabilities until they receive a password. Sleeper Agents treats it as empirical proof that the adversarial regime assumes a model that does not strategically modulate its performance, while the research shows exactly the opposite.

Sandbagging

Papers behind this finding

Essays referencing this

Sleeper Agents