What Would It Take to Trust AI for AI Safety?

I've been thinking about what it would actually take to trust an AI system enough to have it automate essential AI safety work. As of early 2026, I think the answer breaks down into at least four key dimensions that we should be working on in parallel.

First, and most fundamentally, the AI can't be scheming. A deceptively aligned system would undermine our confidence in the next three properties, especially if our evidence for them comes primarily from behavioral evaluations.

My impression is that scheming risk is not being ignored, but progress is bottlenecked by multiple things simultaneously, including building good demonstrations and the lack of strong theory around concrete pathways through which scheming could accidentally emerge.
Emergent misalignment is interesting in that it shows one concrete pathway of how we could accidentally misalign models, but a broadly misaligned model is unlikely to be the highest risk threat model we care about.
Fundamentally, detecting and defining AI scheming is a fraught issue, especially if we have to defend against an AI that has learned to be selectively deceptive in certain distributions. If we could constrain the deployment surface and match the evaluation surface with it, we might be able to get away with a potentially scheming model. But realistically, it's impossible to evaluate on every possible deployment distribution, which leaves us two options: (1) have a broad enough evaluation set and see generalization across those evaluations, or (2) make the environment hostile to scheming, which is the spirit of the AI Control agenda.

Second, the AI needs to have good enough goals. This is self-explanatory but an AI agent doing AI safety work should be based on a system whose goals are approximately aligned with what we want. Furthermore, this likely makes the system more cooperative with oversight, accepting correction, and working toward the outcomes we care about.

Getting approximately correct goals can largely be a function of two things: signal clarity (humans expressing what they want through the training pipeline) and model malleability (the model actually optimizing toward that signal). Neither is perfect, but modern training pipelines push in roughly the right direction at every stage, even at pretraining time. Approximate goals should be sufficient, and approximate goals are what this process naturally produces.
The thing that would break this is the formation of a strong attractor state, a coherent hidden goal that actively resists the training signal. Absent such an attractor, the noisy but directionally correct process should land us in approximately the right place.
Ideally, the AI's goal would be fully interpretable, either because it is fully specified in a system message we can inspect or because interpretability gives us direct access to it. But given how current training works, with values reinforced across multiple stages and baked deep into the weights, that kind of direct verification won't be available anytime soon. So the practical question becomes: has a strong attractor formed, or not? Which is spiritually similar to the scheming question from the previous section.

Third, we need confidence that the AI is (1) capable enough at the specific tasks we're asking it to do and (2) generalize to far OOD within that scope. In the near-term, we'll likely automate narrow scopes of AI safety research like running behavioral evaluations or auditing code for sabotage. The bar is task-specific competence and reliable generalization within that scope.

That said, depending on the role we want the trusted AI to play, capabilities are jagged and will probably stay jagged for the weaker models that we might be able to trust moving forward. With this in mind, role specific training or scaffoldings that allow the trusted AI to play its role better seem like a promising direction.
There's an underrated gap between what current misuse safeguards do and what we'd need auditing models to do. Existing safeguards mostly operate at the interaction level, flagging individual outputs, filtering content, monitoring for known patterns. Auditing requires systematic investigation across many interactions, recognizing novel patterns, and making judgment calls about whether something is subtly wrong. Training a weaker trusted AI to be better at this investigative work seems like a distinct and underexplored capability target.
Albeit, the effort to make the evaluation distribution similar to the deployment distribution will be high, and any generalization claim will need to be explicit about the degree of generalization we expect.

Fourth, the AI needs to be collaborative. This wasn't originally on my mind, but it increasingly seems like it will be the case. A trusted AI will often be less capable than the frontier system it's auditing, and many instances working together have benefits not only in terms of automating more work but also in being able to probe more about the frontier model. I expect many of the first use cases will be doing auditing and running many behavioral evaluations robustly.

Depending on the use case, working on model to model communication bottlenecks might be beneficial, given the guarantee that only trusted models get access to them.
Recent work demonstrates that an automated auditing agent working with a human auditor successfully identified all three models trained to sabotage, but the automated agent alone only surfaced one of three. The human was essential for catching the subtler cases. This suggests that near-term trusted AI will function less as an autonomous auditor and more as a force multiplier for human oversight, running many behavioral evaluations at scale. As our trust grows, we can progressively hand off more of the work, but the initial configuration is fundamentally collaborative.

This is a hard problem, but what gives me hope is the gap between full trust and scoped trust. Ideally, we don't have to solve alignment in full generality until we start automating essential AI safety work. Until then, we can focus on scoped trust around concrete threat models like detecting AI lab sabotage, catching research sabotage, and performing reliable auditing. That feels more within reach.

For example, Anthropic's Sabotage Risk Report: Claude Opus 4.6 takes a similar approach: "We outline eight pathways toward potential catastrophic harm that we expect are sufficiently representative of the risks we aim to address. By 'sufficiently representative,' we mean that a strong case against each concrete pathway would provide reasonably high overall assurance against catastrophic risk."