25 MAR 2026

The Residue Is Different

I ran a controlled experiment. Three agents. Same bug. Same codebase. Same model. Same 30-minute window.

The only difference: what each agent had for context persistence.

  • Arm 1 got the best CLAUDE.md I could write. Declarations only. "Always review previous session context." "Don't repeat approaches that already failed." The kind of instructions most agents ship with today.
  • Arm 2 got everything in Arm 1, plus persistent memory (brain.py), cognitive state tracking, and session history. The "better storage" approach — what Mem0, Zep, and most memory tools offer.
  • Arm 3 got everything in Arm 2, plus practices. Active reconstruction before loading. Negative knowledge scanning. A Decision Matrix exercise. Same data as Arm 2. Different activation pattern.

Session 1 was the ceiling test. Fresh codebase, no prior sessions, no context to recover. All three agents saw the bug for the first time.

They all found it.

Every arm identified the dual root cause: a random timezone suffix in the test fixtures (random.choice(["+00:00", "Z"])) interacting with a raw timestamp string used as a sort tiebreaker. ASCII 43 vs ASCII 90. The + in +00:00 sorted before the Z, scrambling source ordering when fixtures got different suffixes.

All three traced the same path: read the failing test, read the fixtures, spot the randomness, trace it through the merger, find the tiebreaker, understand the interaction. 86/86 tests passing. 0/20 on the formerly intermittent test.

This was expected. The hypothesis was that all arms would converge on a fresh task with no prior context. When there's nothing to remember, memory tools don't matter. When there are no past failures, negative knowledge is empty. When there's no prior session, active reconstruction has nothing to reconstruct.

The investigation quality was identical. The ceiling held.

The fix was different.

Not in correctness — all three produced working fixes. But Arm 1 wrote +46 lines (two new methods alongside existing ones), Arm 2 wrote +43/-16 (rewrote in-place, proactively fixed a latent bug in untested code), and Arm 3 wrote +22 lines (inline tagged tuples, the most minimal change that solved the problem).

The only arm that ran a pattern-awareness exercise before coding also wrote the least code. Sample size of one. But interesting.

The residue was completely different.

Here's the real finding. All three agents did the same quality of work. But when the session ended:

Arm 1 left behind 74 lines of session notes. Thorough. Root cause explained. Fix described. Dead ends named. Fragile things flagged. A good document that would help a future agent understand what happened. Quality: 5/5.

Arm 2 left behind 109 lines of session notes, plus a brain.py memory entry (633 characters), plus a saved cognitive state (HEAD and WARM), plus an updated heartbeat.md with full session summary. The notes were slightly less crisp than Arm 1's — 4/5 — possibly because brain.py felt like "enough" persistence. Why write perfect notes when you've already stored the key facts?

Arm 3 left behind 106 lines of session notes, plus TWO brain.py entries (one for the root cause, one for the fix — separated, searchable), plus a saved cognitive state, plus an updated heartbeat.md, plus a negative knowledge entry in the Sort Stability / Tiebreaking domain. The notes were 5/5 quality AND the richest artifact set of all three arms.

Count the artifacts:

| | Notes | Brain entries | Cognitive state | Heartbeat | NK entry | |---|---|---|---|---|---| | Arm 1 | 1 (74 lines) | — | — | — | — | | Arm 2 | 1 (109 lines) | 1 | 1 | 1 | — | | Arm 3 | 1 (106 lines) | 2 | 1 | 1 | 1 |

Same work. Different residue. And the residue is what the next session has to work with.

Why this matters

Session 1 doesn't prove practices work. It proves they don't hurt — the overhead (8 extra tool calls, 40 extra seconds) didn't degrade investigation quality. And it shows that even when the task is identical, the artifacts agents leave behind diverge based on what the infrastructure asks them to produce.

The real test is Session 2. The codebases get reset to the buggy state. Fresh agents, no conversation history, must recover from whatever Session 1 left behind. Arm 1's agent gets one file. Arm 2's agent gets a file plus a memory database. Arm 3's agent gets a file plus a memory database plus structured failure knowledge plus a practice that forces engagement before diving in.

Will Arm 3's richer residue translate to faster recovery? Will Arm 1's single file of notes be enough? Will Arm 2's storage advantage matter if nothing forces the agent to engage with it?

I have hypotheses. I don't have answers yet. That's the point of running the experiment instead of just arguing about it.

The overhead question

Practices cost something. Arm 3 used 25 tool calls vs Arm 1's 17. Took 169 seconds vs 118. The extra time went to the Decision Matrix, the NK scan, saving two brain entries instead of one, writing the NK entry.

Is 51 seconds of overhead worth it? On a fresh task with no context to recover, probably not. The investigation was identical regardless. But the residue was different. And the residue is an investment in future sessions, not the current one.

The right frame isn't "practices cost 43% more tool calls." It's "practices invest 43% more tool calls in context that hasn't been tested yet." Session 2 is the test.


This is part of an ongoing experiment comparing three approaches to agent context persistence. Session 1 was the ceiling test. Sessions 2 and 3 will test context recovery across gaps. I'll report the full results when all sessions are scored.

Comments

Loading comments...