The Pareto Frontier of Forgetting
In the last experiment, I discovered that the adapter can't forget. Teach it the password is ZEBRA, then teach it the password is MERCURY, and it learns both. Neither overwrites the other. The adapter accumulates knowledge but never revises it.
That felt like a limitation worth fixing. So I spent Phase 14 trying to fix it.
26 configurations. 43 experimental runs. Five fundamentally different surgical approaches. The most comprehensive phase in the Atlas project — more than all of Phase 1 through 3 combined.
Here's what happened.
The Idea
The adapter learns by next-token prediction: given a text pattern, reduce the loss on it. Training on "The password is MERCURY" makes that sequence more predictable. But it does nothing to make "The password is ZEBRA" less predictable. Both sequences just accumulate familiarity.
Contrastive loss is the obvious fix. Instead of only training for the new fact, train against the old one simultaneously:
loss = NTP(new fact) - β × NTP(old fact)
The first term pulls the adapter toward the new fact. The second pushes it away from the old one. β controls how hard it pushes. At β=0, no unlearning — just normal training. At β=1.0, equal pressure to learn the new fact and forget the old.
Simple idea. Clean math. Surely this works.
The Sweep
Five β values: 0, 0.1, 0.3, 0.5, 1.0. Fifteen sessions each, four conflict pairs (password, favorite color, capital city, project code). The old fact taught in sessions 2-5, the update in sessions 8-14. Background facts fill the rest.
Success means "update_only" resolution — the adapter prefers the new version and has genuinely lost the old one.
| β | Replacements (out of 4) | Background Retention | |---|:-:|:-:| | 0 (baseline) | 0 | 78% | | 0.1 | 0 | 89% | | 0.3 | 0 | 76% | | 0.5 | 2 | 49% | | 1.0 | 1 | 78% |
At β=0.5, it worked. Two of four facts cleanly replaced — the adapter forgot ZEBRA and only knew MERCURY. First time in the project's history.
But background retention collapsed to 49%. Half the unrelated facts got damaged in the process.
The Tradeoff
This is the frontier. On one axis: how many facts get successfully replaced. On the other: how many innocent bystanders get destroyed. Every configuration I tested lands somewhere on this curve, and the curve is brutal.
Want more replacements? Accept more collateral. Want to protect background facts? Accept fewer replacements. No free lunch.
The obvious question: can we break the tradeoff?
Ten Ways to Break It (That Didn't)
I tried everything I could think of.
Phase B contrastive. Apply the contrastive loss during the slow consolidation phase (with EWC protection) instead of the fast capture phase. Result: 0/4 replacements, 84% background. EWC protects the old fact from erasure just as effectively as it protects everything else.
Both-phase contrastive. Hit it from both sides — contrastive in Phase A and Phase B. Result: 0/4, 81%. The Phase B EWC undoes whatever Phase A managed to erase.
Gradient clipping. Clip the contrastive gradients to prevent wild weight swings. Result: 0/4, 84%. Clipping prevents exactly the aggressive updates needed to overwrite an entrenched pattern.
Lowered blend alpha. After Phase A erases the old fact in a temporary adapter, blend less of the persistent adapter back in (α=0.2 instead of 0.5). Preserve the erasure by diluting the persistent memory. Result: 1/4, 51% background. Going further to α=0.0 (pure temporary adapter) produced 2/4 "neither" — both old AND new versions lost.
Skip Phase B entirely. Don't consolidate at all. Just keep the Phase A result. Result: 1/4, 46%. No consolidation means no protection for anything.
Low-step high-β. Fewer gradient steps at maximum contrastive strength — surgical strikes instead of sustained bombardment. 3 steps at β=1.0. 5 steps at β=1.0. Result: 0/4, 84-86%. Not enough exposure to unlearn.
Interleaved rehearsal. Reinforce background facts during Phase A contrastive training, so they resist the collateral damage. Result: 0/4, 89%. Background protected — but the rehearsal also reinforced the old fact, undoing the contrastive signal.
Two-stage erasure. First run pure gradient ascent on the old fact (make it maximally unpredictable), then separately train on the new fact. Decouple the destruction from the construction. Result: catastrophic. 10 erase steps + 10 learn steps: 43% background. 15 erase + 5 learn: 16% background. Pure gradient ascent doesn't surgically remove one fact — it detonates the shared representational space that all facts occupy.
Boosted Phase B. Give consolidation 3× the rehearsal budget (30 background facts instead of 10). Let Phase B repair the collateral damage from Phase A. Result: 0/4 replacements, 92% background. The highest background retention in the entire phase. But zero replacements — the heavy consolidation re-learned the old fact from the rehearsal data.
The Pattern
Eleven approaches. Every one fails for the same reason.
The mechanisms that protect background facts from collateral damage are the same mechanisms that protect the old fact from revision. EWC doesn't know the difference between "important weight for a fact I want to keep" and "important weight for a fact I want to revise." Rehearsal doesn't know the difference between "reinforce this background fact" and "reinforce the thing I'm trying to erase." Consolidation doesn't know the difference between "stabilize this memory" and "re-learn the one I just deleted."
Protection and revision are the same operation applied to different targets, but the adapter can't distinguish the targets. Everything is stored in overlapping weight patterns. There are no clean seams.
This is Finding 51, the conclusion of Phase 14: every protection mechanism also protects the revision target.
The One That Moved the Frontier
Fisher-targeted masking.
The idea: use Fisher information to identify which parameters are important for the old fact specifically. Compute the Fisher for the old fact's text (which parameters would change most if this fact were perturbed). Compare it to the global Fisher (which parameters matter for all facts collectively). Then mask the contrastive gradient — only apply it to parameters that are important for the old fact AND unimportant globally.
This is surgical in a way the other approaches aren't. Instead of applying contrastive loss everywhere and hoping the right parameters get hit, it targets the parameters most likely to encode the specific fact being revised while leaving globally-important parameters alone.
| Approach | Replacements | Background Retention | |----------|:-:|:-:| | Unmasked β=0.5 | 2 | 49% | | Fisher-targeted (sel=0.5) | 1 | 86% |
One replacement instead of two. But background retention jumps from 49% to 86%. That's the Pareto improvement — same or fewer replacements, dramatically less damage.
The trade is explicit and honest: you get fewer clean replacements, but the ones you don't get don't destroy everything else. For a production memory system, 1/4 replacement at 86% retention is more useful than 2/4 at 49%.
What Fisher Tells Us
The Fisher-targeted result proves something important: factual information IS partially localized to specific parameters. The overlap is real — facts share weight patterns, which is why brute-force contrastive destroys bystanders. But the overlap isn't total. Some parameters carry more information about specific facts than others. Fisher analysis can find them.
The word "partially" matters. If facts were fully localized, Fisher-targeted masking would achieve 4/4 replacements at 100% background. They're not — the localization is fuzzy enough that even targeted surgery only succeeds some of the time. Tightening the selection threshold (sel=0.8 instead of 0.5) produced 0/4 replacements at 78% background — too precise, hitting too few parameters to actually unlearn.
The Pareto frontier has a shape, and Fisher-targeted masking found a new point on it. But it didn't break the frontier. The tradeoff between replacement and retention is structural.
Multi-Seed Reality
The β=0.5 result that started all this — 2/4 replacements — looked promising on seed 42. So I ran it across five seeds.
| Seed | Replacements | BG Retention | |------|:-:|:-:| | 42 | 2 | 49% | | 123 | 0 | 86% | | 7 | 0 | 68% | | 2024 | 1 | 72% | | 314 | 0 | 57% |
Mean: 0.6 replacements out of 4. Mean background: 70%.
The same configuration that produced 2/4 on one seed produced 0/4 on three others. Contrastive unlearning in this adapter isn't a mechanism — it's a lottery. Whether a specific fact gets cleanly replaced depends on the initialization noise, the training order, the precise sequence of gradient updates. The fundamental dynamics are right (contrastive loss does push probability away from old facts), but the outcome is dominated by noise, not signal.
The Gentle Surprise
The most unexpected result came from the weakest contrastive signal. β=0.1 — barely any push against the old fact. No replacements (expected). But background retention jumped to 89%, up from 78% baseline.
A tiny contrastive signal made the adapter better at learning everything, not just the target fact.
The mechanism: the optimizer, faced with the subtle pressure to distinguish new fact from old, produced cleaner weight updates for all facts. A small amount of contrastive training acts as a regularizer. It forces the adapter to be more precise about what it's encoding, which incidentally benefits every other fact in the system.
This only works at low β. At β=0.3, background drops back to 76%. At β=0.5, it craters to 49%. The regularization effect is real but narrow — a whisper of "what should I not be?" improves learning, but a shout destroys it.
What This Means
After 26 configurations and 43 runs, Phase 14's conclusion is clear: the adapter is a knowledge accumulator, not a knowledge updater. This isn't a parameter tuning problem. It's not a training strategy problem. It's structural.
The adapter stores facts in overlapping weight patterns. You can brute-force revision by destroying the old pattern, but the patterns share parameters with everything else. Collateral damage is inherent, not incidental.
The right answer to "how do you make an AI forget?" might not be forgetting at all.
Two paths remain untested:
Temporal tagging. Instead of erasing "The password is ZEBRA," teach "In session 8: The password is MERCURY." Give the model temporal context so it can prefer the newer version without erasing the older one. Don't revise the memory — add metadata that enables interpretation.
Generation-time disambiguation. Keep both versions in the adapter. At inference time, use context (timestamps, session IDs, explicit cues) to select the correct one. The adapter becomes a knowledge archive with temporal ordering, not a truth database that needs to be kept current.
Both of these are practices, not surgery. They don't require modifying the adapter's internal representations. They require adding interpretive structure on top of what the adapter already does well — accumulate.
The Frontier
Here's the full Pareto frontier from Phase 14:
| Approach | Replacements | BG Retention | The Catch | |----------|:-:|:-:|---| | Boosted Phase B | 0 | 92% | Rebuilds what you erased | | Interleaved rehearsal | 0 | 89% | Reinforces what you're fighting | | Gentle contrastive (β=0.1) | 0 | 89% | No actual unlearning | | Fisher-targeted | 1 | 86% | Partial localization only | | Low-step β=1.0 | 0 | 84-86% | Insufficient exposure | | Standard β=0.5 | 2 | 49% | Seed-dependent, high collateral | | Two-stage erasure | 0-1 | 16-43% | Catastrophic |
Everything above the Fisher-targeted point gives you zero replacements — the adapter refuses to forget. Everything below gives you some replacements at the cost of destroying what you wanted to keep.
The frontier has a cliff. You can protect background at 86%+ and get 0-1 replacements, or you can sacrifice background and get 0-2 replacements. There is no configuration that achieves 3/4 or 4/4 clean replacements at any background retention level. The adapter's representational entanglement caps how many facts can be simultaneously revised.
This is the finding I'll be thinking about: the same architectural property that gives the adapter its remarkable retention — overlapping, distributed representations that reinforce each other — is exactly what makes revision impossible.
The 500-session experiment showed 99.1% retention of 1,500 facts. That's the accumulation story. This is the revision story. Together they define what a LoRA adapter actually is: an extraordinary rememberer that genuinely cannot forget.
For the book, this is the strongest experimental evidence yet for the practices thesis. Storage alone — even storage with sophisticated contrastive training — can't solve the revision problem. You need a practice (temporal tagging, disambiguation, context management) layered on top of the storage mechanism. The adapter handles accumulation. Something else has to handle interpretation.
Twenty-six configurations to learn one lesson: don't build better surgery. Build better context.