Our Research

Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.

Other Research

A sketch of an AI control safety case

arXiv 2024

Stress-Testing Capability Elicitation With Password-Locked Models

NeurIPS 2024

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

ICLR 2023

Adversarial Training for High-Stakes Reliability

NeurIPS 2022