Exploratory AI Alignment Research
We think that powerful, significantly superhuman machine intelligence is more likely than not to be created this century. If current machine learning techniques were scaled up to this level, we think they would by default produce systems that are deceptive or manipulative, and that no solid plans are known for how to avoid this.
We are doing research that we hope will help align these future systems with human interests. We are especially interested in practical projects that are motivated by theoretical arguments for how the techniques we develop might successfully scale to the superhuman regime.
If you would like to help with this work, you can see our open positions here.
Recent Research Highlights
-
Rust Circuit Library
A library for expressing and manipulating tensor computations for neural network interpretability, written in Rust and used in Python notebooks.
-
Causal Scrubbing
A principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations.
-
Interpretability in the Wild
In this work we present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI).
-
Polysemanticity and Capacity in Neural Networks
We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features, and entirely ignore the least important features.
-
Adversarial Training for High-Stakes Reliability
In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training.
-
Language models seem to be much better than humans at next-token prediction
We found that humans seem to be consistently worse at next-token prediction than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1.
Leadership
-
Buck Shlegeris
CTO
-
Nate Thomas
CEO
-
Bill Zito
COO
Board of Directors
-
Holden Karnofsky
Director
-
Nate Thomas
Board Chair and President
-
Paul Christiano
Director