arXiv:2509.11816v2 Announce Type: replace-cross
Abstract: Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause – unlearning targets representations which are too general – and develop a highly selective technique that unlearns robustly while preserving general performance.
Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned.
When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact.
Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.