Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

How can we know when we have truly unlearned certain knowledge from LLMs? Machine unlearning aims to make models forget specific information, but many evaluations rely on output-level metrics like accuracy and perplexity. These surface-level measurements can be misleading. A model might appear to have forgotten (showing near-zero accuracy on forgotten data), yet minimal finetuning can fully restore the seemingly unlearned knowledge.

The authors test several unlearning methods across text, code, and math domains on two open-source LLMs. Rather than just measuring output-level performance, they introduce representation-level diagnostics. Most promisingly, they use CKA Centered Kernel Alignment (CKA) measures how similar two sets of neural representations are, regardless of rotations or reflections. It compares the structure of activation patterns between original and unlearned models. CKA values near 1 indicate the representations maintain the same relationships between examples, while values near 0 suggest the internal structure has been fundamentally reorganized. to track changes in the model's internal representations across layers.

The findings reveal two regimes of NN forgetting. In reversible forgetting, token-level performance collapses but representations remain largely intact. The model can quickly relearn the "forgotten" information because the underlying features are preserved. In irreversible forgetting, both token-level performance and internal representations are damaged, making recovery difficult or impossible.

One key takeaway is that irreversible unlearning is difficult to achieve without requiring many layers to undergo coordinated, large-magnitude perturbations. Methods like gradient ascent with high learning rates or repeated unlearning requests can achieve this, but at the cost of damaging the model's overall capabilities. The authors suggest that effective unlearning evaluation must go beyond token-level metrics to examine these deeper representational changes.

Overall, this paper was a positive belief update (reaffirmation) of our arguments in recent work Distillation Robustifies Unlearning.