The AI Control research program uses side tasks as its core evaluation primitive (i.e., instructing agents to covertly pursue a side task, along with the main task). Why was this a rational approach?
The fundamental problem is that "off-task behavior" is very hard to define. Given a long agentic trajectory spanning dozens or hundreds of tool calls, which actions were contributing to the user's task and which were not? An agent asked to improve a script's runtime might explore the filesystem, read seemingly unrelated files, install packages, or run diagnostic tools. Any of these could be legitimate investigation or covert pursuit of a hidden objective. The same action can be either, depending on intent we cannot observe.
Instructing models to complete a pre-defined side task that's divergent from the main task allows us to solve this problem in reverse. Rather than trying to answer the general question "which of these actions are off-task?" (which requires a definition of off-task behavior that I don't think we have), we start by defining exactly what the covert objective is. Then, it's also operationally easier to classify which actions served it. This also lets us scope which covert objectives we care about. We can choose side tasks that correspond to concrete threat models, like research sabotage.
This framing also seems to naturally elicit the strategic component we care about. We're ultimately worried about AI scheming. Therefore, it's important to know that there was a component of intentionality in the AI's misbehavior. Instructing the agent to covertly pursue a side task forces this through instruction following, which serves as a proxy for the intentionality that would come from genuine goal divergence.
That said, side tasks only measure the capability to covertly pursue a hidden objective when instructed to. Therefore, we can only measure how well a model schemes when told to scheme, not whether it would scheme on its own. It is an open question whether this is a good proxy for uninstructed covert behavior.