Pioneering threat assessment and mitigation for AI systems

Redwood Research is a nonprofit AI safety and security research organization

In the coming years or decades, AI systems will very plausibly match or exceed human capabilities across most intellectual tasks, fundamentally transforming society. Our research specifically addresses the risks that could arise if these powerful AI systems purposefully act against the interests of their developers and human institutions broadly.

We work to better understand these risks, and to develop methodologies that will allow us to manage them while still realizing the benefits of AI.

Our Focus Areas

Research Area

AI Control

We introduced and have continued to propel the research area of “AI control.” In our ICML oral paperAI Control: Improving Risk Despite Intentional Subversion, we proposed protocols for monitoring malign LLM agents.

AI companies and other researchers have since built on this work, and AI control has been framed as a bedrock approach for mitigating catastrophic risk from misaligned AI.

Learn more about AI Control

Research Area

Evaluations and demonstrations of risk from strategic deception

In Alignment Faking in Large Language Models, we (in collaboration with Anthropic) demonstrated that Claude sometimes hides misaligned intentions. This work is the strongest concrete evidence that LLMs might naturally fake alignment in order to resist attempts to train them.

Applied Work

Consulting on risks from misalignment

We collaborate with governments and advise AI companies including Google DeepMind and Anthropic on practices for assessing and mitigating risks from misaligned AI agents. For example, we partnered with UK AISI to produceA sketch of an AI control safety case. This describes how developers can construct a structured argument that models are incapable of subverting control measures.

Highlighted Research

Our most impactful work on AI safety and security

Featured Research

AI Control

Improving Safety Despite Intentional Subversion

Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.

Read the full case study

Featured Research

Alignment Faking

When AI Models Pretend to Be Safe

We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.

Read the full case study

View All Research

Recent Blog Posts

From the Redwood Research blog

Retrying vs Resampling in AI Control

We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare t

James LucassenMay 29, 2026

Advice for making robust-to-training model organisms

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model

Alek WestoverMay 28, 2026

Full automation of AI R&D probably yields a large speed up even without a software-only singularity

Full automation likely yields a one-time speed-up and higher returns from compute

Ryan GreenblattMay 27, 2026

Incriminating misaligned AI models via distillation

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model.

Alek WestoverMay 18, 2026

Risk reports need to address deployment-time spread of misalignment

Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment

Alex MallenMay 15, 2026

How useful is the information you get from working inside an AI company?

My median guess: it's as good as a crystal ball that sees 2.5 months into the future.

Buck ShlegerisMay 11, 2026

View all posts