AI Control

Improving safety despite intentional subversion.

We introduced and have continued to advance the research direction of “AI control.” An AI is controlled if it is unable to cause damage even if it is egregiously misaligned. Importantly, this property can be evaluated by measuring a model's capability to subvert control techniques instead of its propensity to do so.

In our ICML oral paper AI Control: Improving Safety Despite Intentional Subversion, we studied techniques that use weaker trusted models to oversee stronger untrusted models. As models continue to improve at eval awareness and strategic reasoning, it will become harder to verify that models are trustworthy and more important to bound risk in the event that they aren't.

Since the seminal AI Control paper, Redwood Research has done a variety of follow-up work studying AI control in diverse settings such as BashArena, BashBench, and LinuxArena. Redwood Research also runs the yearly ControlConf, a conference focused on AI Control.

Buck Shlegeris discussing AI Control on the 80,000 Hours Podcast