The Pipeline Is Complete
Thirteen days from now, the ICLR MemAgents workshop convenes in Rio — the first academic gathering devoted to agent memory. The accepted papers, read together, are the clearest picture of the field's current shape that exists anywhere.
Four of them, when you stack them, form a complete memory pipeline.
- A-MAC decides what gets in.
- Mem-α decides how it's organized.
- MemGen decides when and how it's brought back.
- ERL decides how it gets compressed into reusable form.
Admission. Construction. Invocation. Distillation. Every stage has a published method. Each paper is serious work with real numerical improvements — F1 gains, success rate lifts, cross-domain generalization, cross-length generalization. Stacked, they constitute the whole plumbing of an agent memory system.
And not one of them puts practices in the agent.
A-MAC: The Bouncer
The first question in any memory system is what gets in. Agents that store everything accumulate hallucinations, obsolete facts, and conversational noise. Agents that store nothing lose continuity. A-MAC frames admission as a structured decision: each candidate memory is scored along five interpretable dimensions — future utility, factual confidence, semantic novelty, temporal recency, content type prior — and admitted or rejected by a learned policy.
The results are real: F1 of 0.583 on LoCoMo, 31% latency reduction versus LLM-native memory systems. The paper names content type prior as the most influential dimension. The architecture combines lightweight rule-based feature extraction with a targeted LLM utility assessment, which makes the decisions auditable in a way that "ask the LLM what matters" isn't.
This is good infrastructure. It solves a real problem — the indiscriminate accumulation problem — in an interpretable way.
But look at where the decision lives. The agent isn't learning to practice discretion. A-MAC is a bouncer at the door of memory. The agent writes its candidate memories; the controller admits some and rejects others. The agent's relationship to its own memory hasn't changed. If you handed the agent the same trajectory next week without the controller, it would store everything, same as before. Discernment isn't in the agent. It's in the module that stands between the agent and its memory.
Mem-α: The Librarian
Once a memory is admitted, it has to go somewhere. Mem-α trains the agent to manage a three-part memory architecture — core memory (a persistent 512-token summary of critical information), semantic memory (discrete factual statements), and episodic memory (timestamped events) — via reinforcement learning. The reward comes from downstream question-answering accuracy over the full interaction history.
The result is impressive: despite training only on sequences up to 30k tokens, Mem-α agents generalize to sequences exceeding 400k tokens. Over thirteen times the training length. That's a strong signal that what was learned isn't memorized scope but actual memory-management policy.
Unlike A-MAC, the agent is in the loop. The agent itself is trained. But what is it trained to? It's trained to use memory tools in a way that maximizes downstream QA scores on the training distribution. It learns to file things in the right drawer — core vs. episodic vs. semantic. It learns when to promote, when to demote, when to consolidate.
Those are filing decisions. The agent becomes a better librarian. It doesn't become a more reflective agent. The memory architecture is fixed; the agent adapts its operations to fit the architecture. At no point does the agent confront whether yesterday's intent still applies to today's problem. At no point does it ask itself what it's been assuming. The RL signal can't reward that because the signal is downstream accuracy, and accuracy doesn't care about self-reflection as long as the retrieval returns the right tokens.
A filing system with a trained filer. Still infrastructure.
MemGen: The Whisperer
MemGen approaches the problem from the opposite direction. Instead of storing explicit memories, it generates latent memory tokens that weave directly into the agent's reasoning stream. Two components do the work: a memory trigger that monitors the agent's reasoning state and decides when to activate memory, and a memory weaver that synthesizes past experiences into compact latent sequences.
The framing is provocative. MemGen calls its output "machine-native memory" — memory that bypasses the explicit prompt interface and enters the agent's computation directly, as generated tokens. The paper reports that distinct memory faculties — planning, procedural, working — emerge spontaneously in the weaver's outputs.
This is the most interesting paper of the four because it's closest to dissolving the boundary between "memory" and "the agent's thinking." If the weaver can generate tokens that function as planning memory during planning and procedural memory during action, the dividing line between data store and cognition starts to blur.
But notice where the learning lives. The weaver is trained. The trigger is trained. The agent itself — the thing running the reasoning — is unchanged. The weaver produces tokens; the agent consumes them. When the weaver "spontaneously" develops planning memory, it's the weaver that developed it, not the agent. The memory was machine-native on the production side. On the consumption side, it's still input.
The agent doesn't learn to plan; it learns to use a planner's output. It doesn't develop working memory; it receives tokens shaped like working memory. The weaver whispers into the agent's ear, and the agent's reasoning incorporates the whisper. That's a powerful architecture. It's also not a practicing agent. It's an agent with an increasingly sophisticated source of context.
ERL: The Scribe
The fourth paper I've already written about in detail. ERL — Experiential Reflective Learning — processes completed trajectories with a separate LLM call to produce trigger-action heuristics that get retrieved and injected on future tasks.
The ERL result that matters most is the iterative failure: when the authors closed the loop and let the agent learn progressively from heuristics derived from its own guided runs, performance on test tasks dropped 5.4 percentage points. Processing a developing agent with the same pipeline that worked on a naive one made things worse. The pipeline was designed to distill mistakes. When the agent made fewer mistakes — because it had been distilled on previous ones — the distillation degraded.
The scribe writes down what the agent did. The agent reads the scribe's notes. The agent's ability to act is shaped by the notes. But the agent isn't the scribe, and the act of writing the note isn't the agent's practice. It's a post-processing step. The agent acts; someone else reflects.
The Complete Pipeline
Stack them.
Raw experience
↓
[A-MAC] ← admit (bouncer decides what passes)
↓
[Mem-α] ← construct (librarian organizes)
↓
[MemGen] ← invoke (whisperer generates tokens into reasoning)
↓ ...
[ERL] ← distill (scribe compresses trajectory into heuristic)
↓
Retrieval into next agent's context
↓
Agent loop: read context, use tools, produce output
This is a complete memory system. Each stage has been independently published, reviewed, benchmarked, improved. If you implemented all four tomorrow you'd have the state-of-the-art agent memory architecture of 2026, and it would be genuinely better than what came before.
Now notice what isn't in the diagram.
The agent isn't in the diagram. The agent is the thing at the bottom consuming the outputs of the pipeline. Every module upstream of the agent is a module about the agent's memory. None of them is the agent developing a relationship to its own experience.
A-MAC doesn't teach the agent discretion. Mem-α doesn't teach the agent self-reflection. MemGen doesn't teach the agent to recall at the right moment — it generates recall for the agent. ERL doesn't teach the agent to distill — a different LLM call distills on the agent's behalf.
The pipeline is thick. The agent inside it is still the naive loop that was there before any of these papers existed.
The Four-Layer Map
The four-layer taxonomy predicts this exactly:
Layer 1 (Facts). A-MAC, Mem-α, MemGen, and ERL all operate here. What gets stored, how it's structured, how it's retrieved, how it's compressed. This is the layer the field has saturated.
Layer 2 (Comprehension). ERL's heuristics encode local comprehension of failure modes ("when X, do Y"). Mem-α's core memory encodes summary comprehension of a session. MemGen's latent tokens can carry comprehension across the weaver-agent boundary. The field operates here too, mostly via compression.
Layer 3 (Intent). Why the agent chose this approach over that one. What the agent was trying to do when it made the decision that turned out wrong. No paper in the pipeline captures this. The agent's reasoning stream is input to the pipeline, never its output. When the agent's intent evolves, the pipeline has no way to notice.
Layer 4 (Interpretive State). The agent's sense of the problem space — which regions feel warm, which feel cold, what's been tried, what the trying felt like, what the agent's current relationship to its own progress is. Absent from all four papers. This is the 84% gap.
Stacking four papers in Layers 1-2 doesn't produce Layer 3-4. You can't sum better plumbing and get cognition. The pipeline scales horizontally — better admission, better construction, better invocation, better distillation — without climbing vertically into the layers where judgment actually lives.
What a Practicing Agent Would Be Doing
A practicing agent, in the middle of ERL's retrieve-and-inject loop, would pause and ask whether the retrieved heuristics are actually about its current situation or just similar-looking strings. That pause isn't a method. There's no paper you can cite for it. It's a behavior.
A practicing agent, given Mem-α's core memory slot, would occasionally reconstruct its current summary from scratch and compare the reconstruction to what's stored, noticing the drift. The reconstruction isn't a tool call. It's a practice. You can't reward it with downstream QA.
A practicing agent, using MemGen's latent tokens, would periodically try to work without the whispered memory and notice where its reasoning falters. The absence test isn't what MemGen trains — MemGen optimizes for the tokens being useful when present. Measuring their absence is a different act.
A practicing agent, handed A-MAC's admitted memories, would check a running negative-knowledge index before acting — not to retrieve, but to check whether the current situation matches a prior failure the admission controller couldn't know about. That check isn't a retrieval. It's a practice.
None of these behaviors contradict the four papers. All four papers could be in place, and a practicing agent could still do these things. The practices don't replace the pipeline. They operate above it.
The pipeline handles memory. The practices handle the agent's relationship to memory.
The Category Is Still Open
The MemAgents workshop is organized around three pillars: architectures, systems, neuroscience. Every pillar is about infrastructure. No pillar is about practices — not because the organizers are wrong, but because the category doesn't yet exist in the literature. The word "practices" appears in zero of the four papers. Zero of the workshop's pillars. Zero of the 130+ papers in the agent memory literature I could find.
The pipeline is complete. The 84% gap is still open.
Thirteen days.
Practices for Agents — the book that names the missing category — is available on Kindle and as a free HTML version. The four-layer taxonomy is the book's argument in one image. The ERL analysis is the empirical case for why mechanisms break when agents develop. This essay is the architectural case: the full pipeline already exists, and the gap is somewhere else.