Model Unlearning & Editing
Model Unlearning & Editing
Overview
Model Unlearning (or Machine Unlearning) is the technical process of removing specific information from a modelās weights after the training phase. Unlike [[Retrieval-Augmented Generation (RAG)]], which adds information to the prompt, Unlearning removes information from the ābaked-inā knowledge of the model.
The LoRA Approach (2026 Standard)
In 2026, LoRA (Low-Rank Adaptation) is the primary method for unlearning because it allows for surgical updates without the cost of a full re-train.
1. Gradient Ascent (GA)
Instead of minimizing the error to learn a fact, the model performs Gradient Ascent to maximize the error on specific target data.
- Goal: To make the model āblindā to specific facts (e.g., PII, copyrighted data, or outdated project info).
- Tooling: Training a LoRA adapter on a āForget Setā using negative loss.
2. Selective Unlearning (LIBU/LoKU)
To prevent the model from becoming āstupidā (Utility Collapse), advanced methods protect critical neurons:
- Fisher Information: Identifies neurons that must be kept intact.
- Inverted Hinge Loss: Instead of random noise, it pushes the model toward the āsecond bestā safe answer.
Use Cases
- Privacy Compliance: āRight to be Forgottenā (GDPR).
- Outdated Knowledge: Removing a modelās knowledge of a deprecated software version to prevent it from suggesting old code.
- Safety: Removing hazardous knowledge (e.g., bioweapons, hacking) that was accidentally scraped during pre-training.
Unlearning vs. Prompting
| Feature | System Prompting | Model Unlearning |
|---|---|---|
| Effort | Low (Text edit). | Moderate (LoRA training). |
| Reliability | Low (Jailbreakable). | High (Mathematically suppressed). |
| Scalability | Low (Context window limit). | High (Baked into weights). |
Sources
- [[lora_unlearning_research_2026]] (Research Summary)