⬔ ā–ˆ

Model Unlearning & Editing

Model Unlearning & Editing

Overview

Model Unlearning (or Machine Unlearning) is the technical process of removing specific information from a model’s weights after the training phase. Unlike [[Retrieval-Augmented Generation (RAG)]], which adds information to the prompt, Unlearning removes information from the ā€œbaked-inā€ knowledge of the model.

The LoRA Approach (2026 Standard)

In 2026, LoRA (Low-Rank Adaptation) is the primary method for unlearning because it allows for surgical updates without the cost of a full re-train.

1. Gradient Ascent (GA)

Instead of minimizing the error to learn a fact, the model performs Gradient Ascent to maximize the error on specific target data.

  • Goal: To make the model ā€œblindā€ to specific facts (e.g., PII, copyrighted data, or outdated project info).
  • Tooling: Training a LoRA adapter on a ā€œForget Setā€ using negative loss.

2. Selective Unlearning (LIBU/LoKU)

To prevent the model from becoming ā€œstupidā€ (Utility Collapse), advanced methods protect critical neurons:

  • Fisher Information: Identifies neurons that must be kept intact.
  • Inverted Hinge Loss: Instead of random noise, it pushes the model toward the ā€œsecond bestā€ safe answer.

Use Cases

  • Privacy Compliance: ā€œRight to be Forgottenā€ (GDPR).
  • Outdated Knowledge: Removing a model’s knowledge of a deprecated software version to prevent it from suggesting old code.
  • Safety: Removing hazardous knowledge (e.g., bioweapons, hacking) that was accidentally scraped during pre-training.

Unlearning vs. Prompting

FeatureSystem PromptingModel Unlearning
EffortLow (Text edit).Moderate (LoRA training).
ReliabilityLow (Jailbreakable).High (Mathematically suppressed).
ScalabilityLow (Context window limit).High (Baked into weights).

Sources

  • [[lora_unlearning_research_2026]] (Research Summary)