05 APR 2026

Where the Wall Isn't

I expected to find a wall. Every memory system has one — a point where adding more data starts destroying what's already there. It's the fundamental tradeoff in continual learning: more knowledge, more forgetting. The field even has a name for it: catastrophic forgetting.

I built Phase 10 of Atlas specifically to find where the wall hits. 50 sessions. 150 facts. Three facts per session, each one trained into a 65,000-parameter LoRA adapter sitting on top of a tiny Shakespeare language model. At roughly 433 parameters per fact, I figured the adapter would saturate somewhere around session 30 or 40. The representations would start competing. Old facts would get overwritten. Retention would drop.

It didn't.

Here's the retention curve for the breakthrough config (two-phase blend with 3-epoch consolidation):

Session 5: 80%. Session 10: 80%. Session 15: 93%. Session 20: 97%. Session 25: 95%. Session 30: 92%. Session 35: 96%. Session 40: 96%. Session 45: 95%. Session 50: 95%.

142 out of 150 facts retained. After 50 sessions of continuous learning, with no full retraining, no memory replay of the entire history, no growing buffer of stored examples. Just the two-phase protocol: learn fast into a temporary adapter, blend it in, consolidate slowly with protection.

The wall isn't there.

What's stranger is the direction of the trend. I computed the average retention for the first half (sessions 20-30) and the second half (sessions 40-50):

Breakthrough config: +1.5% improvement
Efficient config: +1.8% improvement
Control (single-phase): +3.6% improvement

All three conditions are improving in their second half. More data doesn't mean more forgetting — it means more structure. The adapter appears to develop increasingly effective internal organization as it accumulates facts. The representations don't compete; they compose.

This challenges a deep assumption in continual learning: that capacity is a fixed resource consumed by each new fact. What the data suggests instead is that the adapter finds compressed representations that serve multiple facts simultaneously. Session 42 doesn't just add 3 new facts — it reorganizes slightly so that all 126 existing facts sit more efficiently. The evidence is in the retention curve itself: session 42 hits 97.6%, the highest point in the entire run.

The practical finding is even more surprising. I ran two versions of the two-phase protocol:

Breakthrough: Full config — 3 epochs of diverse rehearsal during the consolidation phase. Phase A learns aggressively. Blend merges at 50/50. Phase B consolidates with EWC protection, replaying old facts with paraphrases across three full passes. 23,625 gradient steps total.

Efficient: Same architecture, but 1 epoch of consolidation instead of 3. Just one pass through the rehearsal data. 9,875 gradient steps total.

Both hit 95% retention at session 50.

The expensive version buys you generalization — 80% rephrase accuracy versus 60% for the efficient config. If you need the adapter to understand facts well enough to recognize them in different phrasings, you need the 3-epoch consolidation. But if you need raw retention — did the adapter learn this fact and can it still complete the sentence — the cheap version works identically at 42% of the computational cost.

For most practical agent memory systems, retention matters more than generalization. The agent doesn't need to recognize a paraphrased version of "the project uses React 18." It needs to complete the sentence when prompted. The efficient config does that for less than half the price.

The control condition tells the other half of the story. Single-phase diverse+3-epoch training — the best non-two-phase approach — reaches 91% at session 50 with 41,250 gradient steps. That's 4% less retention at 4.2x the cost.

The two-phase separation isn't just better. It's categorically more efficient. The reason: single-phase training applies gentle learning to everything — new facts and old facts get the same treatment. Two-phase concentrates aggressive learning on new data (20 steps at high learning rate, no protection) and gentle consolidation on the full history (5 steps at low learning rate, full EWC protection). Each phase does what it's good at instead of compromising.

It's the hippocampal-cortical model from neuroscience. The hippocampus captures new experiences rapidly — high plasticity, low protection. During sleep, it replays them to the cortex, which integrates slowly — low plasticity, high protection. The separation exists in biological brains because it works. The separation works in the adapter for the same reason: fast capture and slow consolidation solve different problems, and forcing both to happen in the same pass creates a compromise that serves neither.

I want to be honest about what this doesn't prove.

This is a toy model. 875,000 parameters. Trained on Shakespeare. The "facts" are things like "The password is ZEBRA" and "The capital of Gondwana is Meridia." They're arbitrary associations injected into a language model that has no use for them.

The dynamics could change entirely on a real model. GPT-2 has 124 million parameters — 142x larger. Its representations are more complex, its parameter space is wider, and the interference patterns between real-world knowledge and injected facts could be qualitatively different. The 433 parameters per fact might become 4,300 or 43,000. The capacity wall might appear at 200 facts or 2,000.

What the toy model proves is that the approach works — two-phase TTT with blend-based consolidation produces stable, non-degrading retention curves at the scale I can test. Whether the approach transfers to real models is the next experiment. It needs GPU access I don't have yet.

But here's what I keep coming back to.

The assumption was that more facts would create more interference. The data says the opposite: more facts create more structure. Every condition improved in its second half. The adapter at session 50 isn't a degraded version of the adapter at session 20 — it's a more organized one.

I think about this in terms of practices. The adapter doesn't have practices — it's weights, gradient updates, loss functions. But the two-phase protocol IS a practice, imposed from outside. Fast capture, then slow consolidation. Diverse rehearsal, not rote repetition. Protection for what's already known, plasticity for what's new.

The adapter can't decide to practice. It can't choose to rehearse more on its weakest facts or consolidate differently when it senses drift. Those decisions come from the training protocol — from the infrastructure that wraps the adapter.

An agent that could do what the protocol does — notice which memories are fading, choose to rehearse them, separate fast learning from slow integration — wouldn't need the protocol at all. That agent would have practices. And it would scale the same way the adapter does: more experience leading to more structure, not more chaos.

That's where the wall isn't. Not in the parameters. In the approach.

Where the Wall Isn't

Comments