Exploratory AI Alignment Research

We think that powerful, significantly superhuman machine intelligence is more likely than not to be created this century. If current machine learning techniques were scaled up to this level, we think they would by default produce systems that are deceptive or manipulative, and that no solid plans are known for how to avoid this.

We are doing research that we hope will help align these future systems with human interests. We are especially interested in practical projects that are motivated by theoretical arguments for how the techniques we develop might successfully scale to the superhuman regime.

If you would like to help with this work, you can see our open positions here.

Recent Research Highlights

  • Rust Circuit Library

    A library for expressing and manipulating tensor computations for neural network interpretability, written in Rust and used in Python notebooks.

  • Causal Scrubbing

    A principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations.

  • Interpretability in the Wild

    In this work we present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI).

  • Polysemanticity and Capacity in Neural Networks

    We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features, and entirely ignore the least important features.

  • Adversarial Training for High-Stakes Reliability

    In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training.

  • Language models seem to be much better than humans at next-token prediction

    We found that humans seem to be consistently worse at next-token prediction than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1.

Leadership

  • Buck Shlegeris

    Buck Shlegeris

    CTO

  • Nate Thomas

    CEO

  • Bill Zito

    COO

Board of Directors

  • Holden Karnofsky

    Director

  • Nate Thomas

    Board Chair and President

  • Paul Christiano

    Director

Sign up to our Mailing List