How can we know when we have truly unlearned certain knowledge from LLMs? Machine unlearning aims to make models forget specific information, but many evaluations rely on output-level metrics like accuracy and perplexity. These surface-level measurements can be misleading. A model might appear to have forgotten (showing near-zero accuracy on forgotten data), yet minimal finetuning can fully restore the seemingly unlearned knowledge.
The authors test several unlearning methods across text, code, and math domains on two open-source LLMs. Rather than just measuring output-level performance, they introduce representation-level diagnostics. Most promisingly, they use to track changes in the model's internal representations across layers.
The findings reveal two regimes of NN forgetting. In reversible forgetting, token-level performance collapses but representations remain largely intact. The model can quickly relearn the "forgotten" information because the underlying features are preserved. In irreversible forgetting, both token-level performance and internal representations are damaged, making recovery difficult or impossible.
One key takeaway is that irreversible unlearning is difficult to achieve without requiring many layers to undergo coordinated, large-magnitude perturbations. Methods like gradient ascent with high learning rates or repeated unlearning requests can achieve this, but at the cost of damaging the model's overall capabilities. The authors suggest that effective unlearning evaluation must go beyond token-level metrics to examine these deeper representational changes.
Overall, this paper was a positive belief update (reaffirmation) of our arguments in recent work Distillation Robustifies Unlearning.