There has been a recent interest in teaching models to communicate their own uncertainty rather than estimating it through separate post-hoc methods. This direction falls into an existing trend of incorporating self-evaluation capabilities directly into model, rather than relying on external classifiers. From a systems perspective, embedding such metacognitive tokens directly into the generative process is beneficial because it (1) eliminates the need for separate inference passes, (2) likely has more access to its internal recognition of uncertainty as it unfolds during generation rather than retrospectively, and (3) allows the uncertainty signal to be conditioned on the specific tokens being generated rather than just the input query.
The paper's training procedure follows a pattern common to metacognitive tokens training approaches. It begins with confidence token annotation. For each training sample, the base model $M$ first generates predictions $\hat{y}^{(i)}$. When these predictions are incorrect ($y^{(i)} \neq \hat{y}^{(i)}$), the training example is augmented by concatenating an unconfident token $\langle \text{UN} \rangle$ to the incorrect prediction. When correct ($y^{(i)} = \hat{y}^{(i)}$), a confident token $\langle \text{CN} \rangle$ is appended instead. This creates two augmented datasets that are combined to form the training set. The model thus learns to associate confidence tokens with its own error patterns, rather than a vague notion of confidence.
A common pattern for metacognitive training approaches, especially those like A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens, is that they must avoid increasing the likelihood of undesirable model outputs while teaching the model to recognize them (seems like extra-care needed). This routing paper addresses this concern by masking gradients on the incorrect completions themselves, only updating on the confidence tokens. The red flag paper similarly uses KL divergence on harmless continuations to preserve the original distribution. Both approaches require balancing multiple loss terms and attention to where gradients flow during training. Therefore, considerable tuning may be needed to achieve the reported results and application to real-world deployment setups might be finicky -- especially with agentic tasks or reasoning models that have longer trajectories.
An important subtlety is that masking gradients on incorrect answers does not reduce the approach to simple input-based classification. Consider what the model conditions on when predicting confidence. The training objective can be understood as: $$\mathcal{L} = \begin{cases} \mathcal{L}_{\text{ans}}(\hat{y} \mid x) + \mathcal{L}_{\text{conf}}(\langle\text{CN}\rangle \mid x, \hat{y}), & \text{if } \hat{y} = y \\ 0 \cdot \mathcal{L}_{\text{ans}}(\hat{y} \mid x) + \mathcal{L}_{\text{conf}}(\langle\text{UN}\rangle \mid x, \hat{y}), & \text{if } \hat{y} \neq y \end{cases}$$ where the zero coefficient masks the loss on incorrect answer tokens. When the model generates the confidence token, it attends to all previous tokens in the sequence, including both the input query $x$ and the answer it just generated $\hat{y}$. The gradient masking only prevents the model from increasing $P(\hat{y} \mid x)$ when $\hat{y}$ is wrong, but the model still updates its parameters to better predict $P(\langle\text{UN}\rangle \mid x, \hat{y})$. This means the model learns to recognize patterns in its own wrong answers, not just difficult queries. For instance, it might learn that certain types of arithmetic errors warrant lower confidence than others, information that would be unavailable to a classifier that only sees $x$.
If you approach this routing problem from an outside-in perspective, you end up training a verifier to judge the correctness of model outputs. But this is paradoxical for two reasons. First, if you deploy a strong model for verification all the time, why work on routing in the first place? Second, the external monitor doesn't have access to the internals of the model during generation, it completely misses some sense of uncertainty that might emerge during the generative process itself.
Instead, the paper performs a simple binary classification during training, where they label outputs with CN when correct and UN when incorrect. This sidesteps both challenges and sits kind of in the middle. First, it provides a grounding signal, which was previously inaccessible if you purely think from the perspective of uncertainty quantification. This brings out the model's initial fuzzy notion of uncertainty and trains it to report it in a sense that correlates with actual correctness (which is the only signal that we care about for routing decisions). Rather than trying to extract an abstract notion of confidence, the (supervised) training directly aligns the model's internal uncertainty with empirical error patterns. Second, it avoids the need for an external verifier by making the model its own detector. The model learns during training which internal patterns correlate with errors, then simply recognizes these patterns at inference time.