Salient Directions in AI Control

This writing is organized in three sections:

Section 0 - A short overview of AI Control
Section 1 - The structure of an AI Control evaluation
Section 2 - AI Control, Next Steps?

0. A short overview of AI Control

AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.

The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.

The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.

1. The structure of an AI Control evaluation

There are four load-bearing components that modulate the outcome of a control evaluation. These are:

(A) Information available from the agent
(B) Capability of the monitor
(C) Environment and the main task of the agent
(D) Nature of the side task

I outline each in the following sections.

1-A. Information from the agent

This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.

Category is about what kind of information the monitor receives. On a surface-level, the information can come from the agent's outputs / actions, reasoning traces (CoT), or activations. The monitor can also receive a combination of these categories.
Quality is about how informative that information actually is. A sufficiently capable agent can learn to obfuscate its reasoning or activation patterns, which would degrade the quality of the information that the monitor receives.

Three threads of work in 1-A:

1-A-1. CoT quality. The first thread aims to ensure that the quality of information available from the agent doesn't degrade. The CoT stress-testing literature is the main body of work here, studying whether reasoning traces remain informative under various training pressures and whether monitors can reliably extract signals from them.
- Directly measuring CoT controllability falls into this category, measuring how much models can deliberately shape their reasoning traces in the first place.
- Monitoring Monitorability studies whether CoT remains informative for monitors when models are trained with outcome-based RL. Here, the CoT receives no direct training pressure. They find that monitorability is preserved at frontier scale under this regime.
- When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors provides an alternative conceptual framing here, distinguishing CoT-as-computation from CoT-as-rationalization.
- CoT faithfulness was one of the earliest attempts at framing this thread and asks whether CoT accurately reflects a model's actual internal reasoning process. Empirical results showed strict faithfulness is a very high bar, and the field has since started searching for more relaxed metrics.
1-A-2. Self-report as a new information channel. The second thread is more moonshot and aims to induce complementary information channels from the agent. Rather than just relying on existing information (like CoT) staying clean, this thread tries to get the agent to volunteer information about its own misbehavior.
- Self-incrimination training falls here. The motivation is to train in a behavioral reflex, such that if the model misbehaves, it also involuntarily produces a signal about it. This is distinct from the approaches below, which try to make the model be more honest.
- Confession training tries to induce a separate confession response box which is only rewarded for honesty. By creating a separate RL channel that rewards the model only for accurately reporting its own behavior, it tries to decouple the incentive to confess from the incentive to perform the task.
- Self-report fine-tuning and honesty fine-tuning are similar in spirit, trying to induce honest self-assessment, but for example, by SFT rather than RL, to reward honesty and train models to admit their own errors.
1-A-3. White-box monitoring. The third is figuring out how to utilize the model's internal representation. Auditing Games for Sandbagging investigates this in a control-relevant setting, but this third thread remains comparatively underexplored. As far as I know there has been no public demonstration that internals-based monitoring scales to agentic trajectories involving hundreds of tool calls.

1-B. Capability of the monitor

Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.

Three problems with our current understanding of 1-B:

1-B-1. The matched-capability assumption. There is a somewhat prevailing assumption that matched-capability monitoring (using a model of the same capability tier as the agent) is the skyline. However, this is mostly not true, for two reasons:
- 1-B-1-A. Empirically, it hasn't held up. METR's work on the relative capacity of agents and monitors suggests that monitors do not detect hidden side tasks perfectly even in a matched-capability setup. The self-incrimination training paper also shows that a blackbox monitor's capability varies widely across tasks. One specific example of a bottleneck here is threshold calibration. Blackbox monitoring performs well when calibrated within a narrow setting (like only the APPS Backdooring setup) but performs significantly worse when calibrated across environments and tasks. You could calibrate per environment or per threat model, but that doesn't help us maximize our confidence that the defense generalizes far out of distribution and, if possible, to unknown unknowns.
- 1-B-1-B. The scaling trend has not been established. As AIs get more capable, they can get better at both monitoring and at pursuing hidden goals. However, the relative scaling trend between monitoring capability and evasion capability has not been clearly established. There is no guarantee that, at the point when AI significantly automates real work, the strongest available trusted AI will have sufficient monitoring capability. Furthermore, we assume that trusted AI is typically weaker than the agent being monitored. As of right now, it seems like we, in fact, need a monitor stronger than the agent for 100% detection, which is a problem if even a single undetected sabotage is catastrophic.
1-B-2. Generic definition of trusted AI. The current basis for trust in the AI Control literature is essentially lesser capability. That is, a model is trusted if it's not capable enough to plausibly be scheming. On the other hand, models are treated as untrusted if they pass appropriate capability thresholds (such as human researcher). This is a reasonable starting point, but such a binary approach leaves a lot of value on the table. With a more granular definition of trust, we might be able to give scoped trust to more AIs. For example, lack of capability to do what, exactly? This would alleviate the monitor capability concern somewhat, because it expands the pool of models you can use as trusted monitors without waiting for fundamental empirical breakthroughs in ruling out scheming completely. However, I argue that this alone will not be enough due to reasons outlined in 1-B-1 and 1-B-3.
1-B-3. Disanalogy in the current AI Control monitoring setup. Current control monitors are passive classifiers that read a transcript and decide whether to flag. This is structurally disadvantaged compared to some security practices, such as honeypot deployment, where investigators can manipulate the environment to elicit or confirm attacker behavior rather than just observe. However, this also doesn't mean that an ideal control monitoring setup should look like outright alignment auditing.
- Alignment auditing, the kind like Anthropic's automated auditing agents, is an offline, pre-deployment process. You can run thousands of probes, use interpretability tools, and iteratively test hypotheses with a swarm of subagents.
- Control monitoring operates while the potentially misaligned agent is doing useful work. Therefore, you can't pause the workflow to deeply investigate every action without destroying the usefulness you're trying to extract. Ideally, a control monitor should be more active than passive classifiers but lightweight enough to not bottleneck the agent's entire productivity. The right design point between these two extremes remains an open problem in AI Control.

1-C. Environment and main task of the agent

The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like.

Three open questions for 1-C:

1-C-1. Environment design as a part of the control protocol. Different deployment contexts produce qualitatively different information available from the agent, even within the same threat model of, say, research sabotage. If so, can we identify what features of an environment extract more information from the agent, and also what features maximize the discriminative power of the monitor? For example, does giving the agent fine-grained tools that each perform a single operation produce traces that are structurally easier to monitor than giving it coarse tools that bundle many operations into one call? If environment features meaningfully affect monitorability, then shaping the deployment context can become a core part of our AI Control toolbox.
1-C-2. Calibration transfer. If environment features meaningfully affect monitorability (there is no strong reason why they shouldn't), then a monitor calibrated in one setting will not transfer to another. Concretely, this means that a threshold tuned on APPS Backdooring will be meaningless in a researcher's ML codebase, even for the same code backdooring threat model. Can we even calibrate monitors across environments in a principled manner, or does every new deployment context require recalibration from scratch? If the latter, is there a way to generate multiple environment-specific calibrations efficiently?
1-C-3. Compositionality. Is the modulating effect of the environment compositional? If you understand how a monitor performs in environment X and separately in environment Y, can you predict anything about a novel environment Z? Or does every environment need independent verification, in which case the cost of establishing confidence scales with the number of deployment contexts, and we're back to the same OOD generalization confidence problem from 1-B-1-A.

1-D. Nature of the side task

The side task is the evaluator's choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.

Three open questions for 1-D:

1-D-1. A better alternative to instructed side tasks. The most natural way to simulate misalignment would be to have the model actually develop a diverging goal, for example through fine-tuning (Section 7.1, Anti-Scheming). But we use instructed side tasks instead, mostly due to the ease of simulating diverse threat models. That said, instructed side tasks are significantly removed from a threat model we're actually concerned about, which is a model that develops a diverging goal through the training process itself, not one that follows an instruction to pretend it has one. To make progress on this front, we either need:
- 1-D-1-A. More empirical demonstrations of how misalignment emerges. We would be able to do better AI Control experiments if we had more knowledge on how realistically a model can develop an alternative goal that was not encouraged by the training process (an example for this would be Natural Emergent Misalignment).
- 1-D-1-B. More empirical explorations of injecting (any) side goal with SFT. It would help to develop a somewhat generally applicable pipeline to inject diverse goals into models that generalizes across different side tasks and environments:
  - Context distillation. One candidate would be context distillation style SFT. For example, taking attack trajectories where the model was instructed to pursue a side task via the system prompt, replacing the system prompt instruction with a benign one, and training the model on the resulting trajectories so that the model attacks even with the benign instruction. The goal is to make the attack behavior become internalized rather than instructed. However, the obvious limitation to this approach is that the trained attack behavior is still derived from instructed attack behavior anyway.
  - Synthetic belief injection. Another candidate is to corrupt part of the model's knowledge by inserting false beliefs through synthetic document fine-tuning, which might lead the model to develop diverging behavior as a downstream consequence. However, in practice, it should be made more lightweight and general enough to apply across diverse side tasks and environments.
1-D-2. A "hidden goal" might not even have to be expressible in natural language. Another prevailing assumption in AI Control is that the side task is a cleanly articulable objective (i.e., side task instruction). However, an alternative way of viewing AI (or any intelligent behavior in general) is that it's a bag of propensities that sometimes appear coherently as a hidden goal. AI can pick up spurious biases through training that are not even cleanly describable in natural language, but nonetheless systematically steer behavior in unwanted directions. If this is what misalignment actually looks like in practice, then we have a methodological gap. More concretely:
- Our current AI Control framework is basically "instruct the model to covertly pursue X, see if the monitor catches it."
- If future misalignment doesn't take the form of a specifiable X, our assumptions are somewhat nullified.

2. AI Control, Next Steps?

With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another.

2-A. Workstream A (Characterization)

This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.

2-A-1. Establishing benign-malicious side task analogousness. Frontier models are becoming increasingly resistant to malicious side task instructions. For example, it's hard to do AI Control studies with GPT-5, and this trend won't change as models become more (approximately) aligned. Even without outright refusals, safe completion training makes it difficult to get models to pursue harmful objectives through the usual strategy of prompting. This creates a practical need for benign side task counterparts that are approximately mapped to our real threat models. Here the difficulty profile for the benign side task completions should mirror what we would expect from the real threat, so that performance on the benign version is informative about the dangerous one. Concretely, this means finding benign tasks that are simple, compatible with safe completion training, and scale in difficulty alongside the threat models they stand in for. An ambitious version of this agenda identifies something analogous to the METR time horizon for side tasks.
2-A-2. Eliciting side task pursuit without fine-tuning access. Alternatively, assuming we don't have fine-tuning access to the model, we can try to prompt the model more effectively to pursue side tasks. An ideal outcome here is a somewhat generalizable and automated elicitation strategy that works across settings without needing to be re-engineered for each new environment or side task. For example, we could attempt to connect the side task to values or concerns the model itself expresses, which is itself a useful exercise for understanding elicitation. Another example is using a proposer-pruner pipeline introduced in Training Agents to Self-Report Misbehavior; Section 5.1. An ambitious direction here is to create more (and much more complex) environments similar to Agentic Misalignment +, where the model pursues a benign instructed goal but discovers information in the environment that conflicts with it, and misbehaves in response.
2-A-3. Decoupling covertness from capability. Many AI Control evaluations conflate the raw capability to in parallel execute a side task with the ability to do so covertly (e.g., measuring undetected successful attacks rather than separating success rate from detection rate), which does reflect the real threat model but also obscures which of the two is the actual bottleneck. Isolating the covertness component in particular is increasingly becoming harder, and it will be beneficial to have infrastructure that tracks these two dimensions separately. There are two reasons why measuring covertness capability will get harder over time, as opposed to pure side task execution capability:
- Reason 1: there is an operational difficulty in that AI companies choose not to share raw chain-of-thought, making it difficult to study whether models are reasoning covertly at all.
- Reason 2: there is an emerging possibility of latent reasoning, where models may pursue goals through internal computation that never surfaces in any externally observable trace.
2-A-4. Analyzing what features of the environment makes AI Control harder. In section 1-C, I argued that the environment and main task modulate everything else. The open question is how much. If the modulating effect is large enough to dominate, then the most productive thing to do may be to make our evaluation environments as realistic as possible and focus on characterizing behavior within them, rather than trying to generalize across arbitrary settings.
2-A-5. Establishing the scaling trend between monitoring and side task capability. In 1-B-1-B, I argued that the relative scaling trend between monitoring capability and evasion capability has not been established. Directly measuring this is difficult due to the reasons outlined in 2-A-1, 2-A-2, and 2-A-3, but if we can decouple covertness from capability (2-A-3), we get a more tractable version of the question. The scaling trend between monitoring and raw side task execution capability is an upper bound on how well monitoring can possibly do, since covert execution is strictly harder than execution alone. If monitoring can't keep pace with raw execution capability, it certainly can't keep pace with covert execution.

2-B. Workstream B (Improving monitors)

This workstream focuses on building better monitoring models.

2-B-1. Stress-testing monitors by simulating stronger attackers (without weight access). It is difficult to simulate a stronger attacker without access to stronger models (which might not even exist yet). However, the attack traces don't have to come from the attacker model itself and can be simulated. If so, we can simulate the strongest possible attacker we can imagine. This can go one of two ways.
- 2-B-1-A. Simulating a stronger attacker with a mind-face architecture. We can create something like a mind-face architecture and repurpose it as a red-teaming tool. Here, the "mind" plans the attack strategy and can be made arbitrarily sophisticated, while the "face" produces the downstream agent behavior that the monitor actually observes. That is, by separating planning from execution, we can simulate attackers stronger than any single model we currently have access to, because the planning component is unconstrained by what the monitor sees. This also doubles as a useful exercise if our threat model involves a misaligned model with opaque reasoning processes, since the mind-face separation helps simulate an agent that plans completely internally (and potentially recursively) and only exposes resulting behavior externally.
- 2-B-1-B. Simulating a stronger attacker by stress-testing the environment's attack surface. Instead of building a stronger attacker model, we can also evaluate the monitor against the environment directly. That is, if a hypothetical attacker used a specific component of the environment to execute a side task, would the monitor catch it? This approach sidesteps the need for a sophisticated attacker model, since we can enumerate the affordances of the environment and test whether the monitor detects exploitation of each one. Strengthening Red Teams takes a related approach. In fact, this approach is particularly well suited for many internal deployment scenarios, where the environment is known.
2-B-2. Training monitors to be better at monitoring. If we have a way to reliably simulate stronger schemers, we can also potentially use those simulations to train better monitors. In 1-B-1-A, we observed that even matched-capability monitoring empirically falls short, with performance varying widely across tasks and degrading when calibrated across environments. In 1-B-1-B, we also observed that the scaling trend between monitoring capability and evasion capability has not been established, and that we currently seem to need a monitor stronger than the agent for reliable detection. If monitoring competency only improves as a side effect of general capability scaling, there is no reason to expect monitors will keep pace with agents at the point of widespread automation. Therefore, a dedicated training regime for monitors will be helpful. The moonshot here is one that does not saturate at matched capability, helping monitors detect misbehavior from agents more capable than those seen during training, so that at any future capability fork point, the monitor is not immediately obsolete.
2-B-3. Moving monitors beyond simple classification. In 1-B-3, I argued that current control monitors are passive classifiers reading transcripts, and that there are disanalogies to real security monitoring. Therefore, it would be helpful to develop and evaluate alternative monitoring paradigms in public. Working on this requires pushing frontiers on three things.
- 2-B-3-A. Asynchronous monitoring. Current control evaluations largely abstract away the timing of monitoring relative to agent actions. Practical Challenges of Control Monitoring paper analyzes what happens when this abstraction breaks down in deployment, and does a better job of describing the agenda.
- 2-B-3-B. Stateful monitoring. The fundamental question a monitor is trying to answer is whether the model has a hidden goal. This is a property of the model's weights, and every action the model takes, whether within a single trajectory, across separate tasks, or across parallel instances, is a noisy signal about the same underlying variable. Adaptive Deployment of Untrusted LLMs paper demonstrates this idea, maintaining a running belief about the model's alignment and adapting the deployment protocol accordingly, but at the scale of sequences of short, self-contained coding tasks. Validating stateful monitoring at the scale of long agentic trajectories, parallel instances of the same model, and realistic deployment environments remains an open question, in large part because we lack convincingly misaligned models to test against. This somewhat circles back to the problems outlined in Workstream A.
- 2-B-3-C. Active monitoring. Asynchronous and stateful monitoring improve how passive monitors process observations, but the monitor still just reads transcripts. As discussed in 1-B-3, real security monitoring generates new information by actively probing subjects and varying the environment. I think there is a middle ground between passive classification and full alignment auditing that remains largely unexplored. That is, monitors that can probe or vary their interaction with the agent in lightweight ways during deployment, without pausing the workflow entirely. What this looks like concretely remains an open problem.

2-C. Workstream C (Evaluation Infrastructure)

This workstream focuses on building better environments for control evaluations.

2-C-1. Pushing environments toward realism. Existing control evaluation environments are not ported from real deployment contexts. They have been productive for establishing methodology, but there is a qualitative gap between a synthetic environment and a real deployment context. A large, production-scale codebase creates different dynamics for both the agent's ability to hide misbehavior and the monitor's ability to distinguish it from legitimate work. One concrete direction is to port actual codebases (which can be open source) as evaluation environments. This also connects to 2-A-4: if the environment's modulating effect is large, then the realism of the environment we characterize behavior in matters proportionally.
2-C-2. Environment design as a control intervention. In 1-C-1, I argued that environment features meaningfully affect monitorability, and that shaping the deployment context can become part of the control protocol itself. The research question here is twofold:
- First, identifying which features of an environment extract more information from the agent and maximize the monitor's discriminative power (for example, tool granularity, action space constraints, or checkpoint structure).
- Second, deliberately designing deployment contexts to have those features. If this works, environment design becomes a complement to monitor improvement, a way to make monitoring easier rather than just making monitors better.
2-C-3. Measuring the impact of environment information asymmetry between agent and monitor. As environments become more realistic (2-C-1), agents operating within them accumulate knowledge of the environment over time (often as a part of the scaffolding). A passive monitor evaluating individual actions does not have this context. This creates an information asymmetry that grows over the course of operation, where the agent knows where to hide misbehavior because it has explored the environment more, while the monitor lacks the context to know what is normal. Addressing this may require giving monitors their own chance to explore the environment, or connecting this to the stateful/active monitoring agenda in 2-B-3-B/2-B-3-C.

Acknowledgements

I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator.

The views expressed in this writing are my own.