What Would It Take to Trust AI for AI Safety?

Post published in February, 2026

I've been thinking about what it would actually take to trust an AI system enough to have it automate essential AI safety work. As of early 2026, I think the answer breaks down into at least four key dimensions that we should be working on in parallel.

First, and most fundamentally, the AI can't be scheming. A deceptively aligned system would undermine our confidence in the next three properties, especially if our evidence for them comes primarily from behavioral evaluations.

Second, the AI needs to have good enough goals. This is self-explanatory but an AI agent doing AI safety work should be based on a system whose goals are approximately aligned with what we want. Furthermore, this likely makes the system more cooperative with oversight, accepting correction, and working toward the outcomes we care about.

Third, we need confidence that the AI is (1) capable enough at the specific tasks we're asking it to do and (2) generalize to far OOD within that scope. In the near-term, we'll likely automate narrow scopes of AI safety research like running behavioral evaluations or auditing code for sabotage. The bar is task-specific competence and reliable generalization within that scope.

Fourth, the AI needs to be collaborative. This wasn't originally on my mind, but it increasingly seems like it will be the case. A trusted AI will often be less capable than the frontier system it's auditing, and many instances working together have benefits not only in terms of automating more work but also in being able to probe more about the frontier model. I expect many of the first use cases will be doing auditing and running many behavioral evaluations robustly.

This is a hard problem, but what gives me hope is the gap between full trust and scoped trust. Ideally, we don't have to solve alignment in full generality until we start automating essential AI safety work. Until then, we can focus on scoped trust around concrete threat models like detecting AI lab sabotage, catching research sabotage, and performing reliable auditing. That feels more within reach.