Salient Directions in AI Control

Post published in March, 2026

This writing is organized in three sections:

0. A short overview of AI Control

AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.

The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.

The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.

1. The structure of an AI Control evaluation

There are four load-bearing components that modulate the outcome of a control evaluation. These are:

I outline each in the following sections.

1-A. Information from the agent

This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.

Three threads of work in 1-A:

1-B. Capability of the monitor

Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.

Three problems with our current understanding of 1-B:

1-C. Environment and main task of the agent

The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like.

Three open questions for 1-C:

1-D. Nature of the side task

The side task is the evaluator's choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.

Three open questions for 1-D:

2. AI Control, Next Steps?

With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another.

2-A. Workstream A (Characterization)

This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.

2-B. Workstream B (Improving monitors)

This workstream focuses on building better monitoring models.

2-C. Workstream C (Evaluation Infrastructure)

This workstream focuses on building better environments for control evaluations.

Acknowledgements

I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator.

The views expressed in this writing are my own.