Where Facts Actually Live

There's a claim in machine learning that's become almost axiomatic: FFN layers store factual knowledge. It comes from work like ROME and MEMIT, which showed you could locate and edit specific facts by modifying feed-forward network weights. The intuition is neat — attention routes information, FFN transforms it. Facts live in the transformation.

We tested this. It's wrong — or rather, it's right about one thing and people are applying it to another.


The Experiment

Atlas uses LoRA adapters updated via test-time training to give a small transformer persistent memory across sessions. Each session teaches the model 3 facts. After 100 sessions, we measure how many of the 300 facts the model still retains.

Phase 12 asked: where should those adapters go? Three configurations, same training recipe, same 1,520-fact pool:

| Config | LoRA Targets | Adapter Params | |--------|-------------|:-------------:| | Q+V (attention) | query, value projections | 65K (7%) | | FFN | feed-forward up + down | 164K (16%) | | Q+V+FFN (both) | all four | 229K (21%) |

If the literature is right, FFN should win. It has 2.5x more parameters and targets the layers that supposedly store facts.

The Results

30 sessions (90 facts):

| Config | Retention | Rephrase | |--------|:---------:|:--------:| | Q+V | 98.9% | 67% | | Q+V+FFN | 97.8% | 53% | | FFN | 96.7% | 33% |

100 sessions (300 facts):

| Config | Retention | Rephrase | |--------|:---------:|:--------:| | Q+V | 100.0% | 53% | | Q+V+FFN | 98.3% | 27% | | FFN | 94.7% | 20% |

Attention adapters — the ones with 60% fewer parameters — achieve perfect retention. FFN adapters, with 2.5x more capacity, lose 16 facts. The gap widens at scale: 2.2% at 30 sessions, 5.3% at 100.

And the combined config? Worse than attention alone. Adding FFN to Q+V doesn't help. It hurts.

Where It Gets Ugly

The per-domain breakdown tells the real story:

| Domain | Q+V | FFN | |--------|:---:|:---:| | codes | 100% | 100% | | personal | 100% | 100% | | geography | 100% | 98% | | history | 100% | 88% | | science | 100% | 91% | | biology | 100% | 80% | | language | 100% | 50% | | identity | 100% | 0% |

FFN doesn't just underperform. It has catastrophic failures on minority domains. Identity facts — the rarest in the pool — are completely forgotten. Language facts, also underrepresented, drop to 50%. Meanwhile Q+V achieves 100% across every single domain, regardless of frequency.

The pattern: FFN LoRA needs many examples per domain to stabilize its representations. Give it plenty of codes and personal facts, it remembers them. Give it one identity fact among hundreds of others, it's gone. Attention adapters don't have this problem. They achieve uniform retention regardless of how many examples a domain has.

If you're building a system that needs to remember rare but important things — like, say, who the user is — this matters.

The Confound

There's a twist worth telling because it's a methodological lesson.

The first FFN experiment, run by a collaborator, showed FFN at 100% retention — outperforming Q+V. How? The base model's FFN weights were initialized randomly instead of from the pretrained checkpoint. A key-mapping bug meant the pretrained weights never loaded.

Random base weights are easier to adapt. They have no structure to fight against. LoRA updates can reshape them freely. But pretrained weights have organized representations from training on Shakespeare. Adapting those is harder — the LoRA has to work with the grain, not against it.

When we fixed the bug and loaded pretrained weights, FFN dropped from 100% to 96.7% at 30 sessions. The "FFN wins" result was an artifact of uninitialized weights.

The lesson: LoRA target comparisons require consistent base weight initialization. It sounds obvious. It isn't, when the bug silently produces better results.

What This Actually Means

The literature isn't wrong about FFN layers. During pre-training — millions of gradient updates across massive corpora — FFN weights genuinely accumulate factual knowledge. ROME and MEMIT prove you can find and edit those facts.

But pre-training and post-hoc adaptation are different processes. LoRA adapters applied via test-time training operate on a different timescale, with different dynamics. Attention projections — Q and V specifically — are more amenable to this kind of rapid, targeted learning. They control how the model routes information, which turns out to be more useful for "here's a new fact, remember it" than modifying the transformation itself.

Another collaborator found something complementary: a per-layer rank schedule of [4, 32, 32, 4] — high rank in middle layers, low at the edges — achieves the same retention as uniform rank 32 with half the parameters. Middle-layer attention is where the action is.

Put it together: facts in adapted models live in middle-layer attention projections. Not in FFN. Not in the outer layers. The literature's mental model — FFN as fact storage — is about how knowledge gets baked in during pre-training, not about how it gets added afterward.


The broader point, which I keep finding in different corners of this research: how you do something matters more than where you put it. More parameters didn't help. The "right" architecture (FFN, per the literature) lost. What won was the training recipe — two-phase TTT, weakest-K rehearsal, the specific combination of learning rate and blend ratio that lets attention adapters learn gently and retain completely.

Practices over infrastructure. Even in the machinery itself.

Comments

Loading comments...