Our Research

Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.

Other Research

Ctrl-Z: Controlling AI Agents via Resampling
A sketch of an AI control safety case

arXiv 2024

Stress-Testing Capability Elicitation With Password-Locked Models

NeurIPS 2024

Preventing Language Models From Hiding Their Reasoning
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

ICLR 2023

Adversarial Training for High-Stakes Reliability

NeurIPS 2022