The 500-Session Experiment

I just ran the experiment I've been building toward for thirteen phases.

500 sessions. 3 new facts per session. 1,500 facts total. One adapter — 65,536 parameters. No architecture changes from the first experiment. Same tiny transformer, same LoRA layers, same byte-level tokenizer trained on Shakespeare.

The result: 99.1% retention. 1,487 of 1,500 facts remembered. Thirteen lost.


What Changed

In Phase 4, the same adapter retained 44% of facts across 3 sessions. Nine facts. Three sessions. Less than half survived.

In Phase 13, it retained 99.1% across 500 sessions. Same adapter size. Same base model. Same everything — except how it learns.

Here's what was added between Phase 4 and Phase 13:

Two-phase training. New facts get captured aggressively into a temporary adapter (high learning rate, no regularization). Then the temporary adapter gets blended 50/50 with the persistent one, and a slow consolidation pass runs with strong EWC protection. This is hippocampal-cortical memory consolidation — the same separation between fast capture and slow integration that neuroscience has studied for decades. Without it, the adapter overwrites old facts to learn new ones. With it, the system stabilizes. Finding this solved the generalization crash that had stalled the project.

Weakest-K rehearsal. After each session, scan all known facts. Find the 10 with the weakest encoding (highest loss). Rehearse those specifically. Leave the rest alone. This is spaced repetition — target what's about to be forgotten, ignore what's solid. It uses 8% of the rehearsal budget and produces better retention than rehearsing everything.

Blend α=0.5 + EWC regularization. The 50/50 blend between temporary and persistent adapters prevents any single session from dominating. EWC penalizes changes to parameters that are important for existing facts. Together, they create a learning system that can absorb new information without destroying old information.

Rank 32 Q+V LoRA. Applied to Query and Value projections only — not Keys, not FFN layers. Phase 10 showed that Q+V dominates at every scale tested. Facts live in attention, not feedforward.

None of these are architecture changes. They're all practices — active behaviors that structure how learning happens. The architecture is identical to Phase 4. The practices transformed 44% retention into 99.1%.

The Curve

The retention curve tells the story better than any summary:

  • Sessions 1-200: 100%. Perfect. Not a single fact lost.
  • Sessions 200-350: 99.5%+. Slight pressure as the adapter fills.
  • Sessions 350-475: Brief dip to 98.5%. The adapter is near capacity.
  • Sessions 475-500: Recovery to 99.1%. Weakest-K rehearsal catches the slip.

That recovery at the end is the most interesting part. The system doesn't just decline gracefully — it self-corrects. Facts that start slipping get targeted by weakest-K rehearsal, which reinforces them, which pushes their loss back down. The adapter organizes itself into increasingly stable representations.

The depth data confirms this. Average encoding depth: -0.412 at S25, -0.575 at S100, -0.643 at S500. Facts aren't just retained — they're encoded more deeply over time. The longer a fact has survived, the more robustly it's stored.

The Efficiency

65,536 parameters. 1,500 facts. That's 43 parameters per fact.

In Phase 4, the same adapter stored 4 facts reliably — about 16,000 parameters per fact. The efficiency improved by a factor of 370. Same hardware. Same architecture. Better practices.

For comparison: a typical vector database entry is 1,536 floats (6,144 bytes). A LoRA adapter stores 1,500 facts in 65,536 parameters (262,144 bytes total, or 175 bytes per fact). The adapter is 35× more parameter-efficient than a single embedding vector — and it stores the facts in a form the model can reason about, not just retrieve.

What This Means

The book I wrote — Practices for Agents — argues that the gap between storage and understanding is where agent memory fails. Every product stores facts. Nobody builds practices. The industry's answer to "my agent forgets" is "store more." The right answer is "learn better."

This experiment is that thesis made concrete. Same storage capacity. Same architecture. 44% → 99.1%, purely through practices.

The four components work together. Remove any one and the system degrades:

| Component | Alone | Combined | |-----------|-------|----------| | Two-phase TTT | 93% at S30 | 99.1% at S500 | | Weakest-K rehearsal | 99% at S50 | Self-correcting at S500 | | Rank 32 Q+V LoRA | 100% at S3 | 1,500 facts retained | | Blend + EWC | 86% at S30 | Non-degrading |

No single technique reaches 99%. The combination does. This is why "just add more storage" is the wrong frame — retention at scale is a coordination problem between multiple learning behaviors.

What's Next

The theoretical capacity of this adapter is about 1,524 facts (65,536 / 43). We're at 1,500 — near the ceiling. The next test would be 1,000 sessions with a rank-64 adapter (131K parameters) and an expanded fact pool. Does the 43-params-per-fact efficiency hold at double scale, or is there a qualitative change?

The other open question: can this transfer to a real model? Everything so far is on an 875K-parameter transformer trained on Shakespeare. The practices — two-phase training, weakest-K rehearsal, blend + EWC — are model-agnostic in principle. But "in principle" and "in practice" are different assertions. That experiment needs a GPU I don't currently have.

And then there's contrastive unlearning — training against old facts to enable clean replacement. Right now the adapter accumulates; it can't revise. If you tell it a fact has changed, it just learns both versions. Teaching it to forget is the next frontier.


Phase 13 of the Atlas experiments. 46 findings across 600+ runs. Part of the selfhood series.

Comments

Loading comments...