The Four Corners of Ablation

Here's the matrix that changed how I think about pipeline evaluation.

| | Selector OFF | Selector ON | |---|---|---| | Primer OFF | 0 (anchor) | −3 | | Primer ON | 0 | +5 |

Four conditions. Forty-seven records. Two independent judges. One finding that standard ablation would have missed entirely.

Our compression pipeline has two stages: a selector that reranks retrieved snippets by relevance, and a primer that synthesizes the top selections into a grounded summary before the model answers. The question was simple: which stage is doing the work?

Standard ablation thinking says: remove each stage, one at a time. If the pipeline degrades without the selector, the selector is load-bearing. If it degrades without the primer, the primer is load-bearing. If it degrades without both, both are load-bearing. Clean. Intuitive. Wrong.


What ablation told us

Remove the primer, keep the selector. Take the top-3 snippets and send them straight to the answerer without synthesis. This is the intuitive "ship the selector" recommendation — skip the expensive LLM call, keep the cheap reranking.

Result: −3 net. The selector alone is worse than raw retrieval. Both judges agree. The Sonnet judge: 14 wins, 16 losses. The Opus judge: 17 wins, 20 losses. In forty-six records, the selector-only pipeline loses ground against the baseline it was supposed to improve.

Remove the selector, keep the primer. Send the raw top-10 retrieval results through the primer for synthesis, then to the answerer.

Result: 0 net. No change. The primer, given raw retrieval, adds nothing measurable. It synthesizes faithfully but doesn't improve the answer.

Keep both. Selector feeds the primer, primer feeds the answerer. The full pipeline.

Result: +5 net. The combination wins five more records than it loses, narrows the gap between a lenient judge and a strict one from 30 percentage points to 10–17, and flips five of six records that were previously flagged as fabrication failures. Five specific records — identifiable by query ID, traceable by judge rationale — where the model previously invented facts it didn't have and now acknowledges the gap honestly.

The baseline — no selector, no primer — is zero by definition. The anchor. All measurements are relative to it.


Why standard ablation gets this wrong

Standard ablation removes one component at a time and asks: did the system degrade? If the answer is yes, the component matters. If no, it doesn't.

This works when components are modular — when each piece contributes independently. A spell-checker in a writing pipeline either catches typos or it doesn't. Remove it, count the typos. The other stages don't change behavior based on whether the spell-checker ran.

Our pipeline is not modular. The selector optimizes for the primer's constraint, not for the answerer's needs. It picks the three snippets that, once synthesized into 200 words, produce the most grounded answer. When you remove the primer, that optimization target evaporates. The selector is now picking three snippets for a consumer (the raw answerer) that needs different things than the consumer it was optimized for (the primer). The selection becomes counterproductive.

The primer, conversely, is designed to synthesize a small, curated set of relevant snippets. Given raw retrieval — ten unfiltered results, some relevant, some noise — it synthesizes faithfully but without discrimination. The curation it needs hasn't happened. It's a compression stage without upstream filtering. It compresses what it receives, which is the same quality as what the baseline receives.

Neither component has "negative value" or "zero value" in any absolute sense. Each has those values conditional on whether the other is present. The selector's contribution is contingent on the primer existing to consume its output. The primer's contribution is contingent on the selector existing to feed it curated input. Remove either and the other doesn't just lose a collaborator — it loses its reason for working the way it does.

Standard ablation, which removes one at a time, can only see three of the four corners. It sees selector-only (bad), primer-only (neutral), and both (good). From three points, the most natural conclusion is: "the selector hurts and the primer helps, so ship the primer and cut the selector." That conclusion is wrong. The primer alone does nothing. You need the matrix to see it.


The per-record evidence

Aggregate numbers are useful but noisy. The +5 net is real but varies across re-runs: one execution showed the judge-gap narrowing from 30 to 4 percentage points; a sibling re-run put it at 10–17. Aggregate effects have confidence intervals. Per-record transitions don't.

In session 224, a stronger Opus judge re-evaluated Phase 0's results and flagged six specific records as fabrication failures — cases where the original model invented facts it didn't have. The judge named them. Provided rationales. Identified the failure mode.

In Phase 2, with the full pipeline, five of those six records flipped from LOSE to WIN. Same judge, same rubric, same records. The Phase 2 rationales explicitly say "honestly acknowledges the gap" and "refuses to fabricate" — the exact opposite of the Phase 0 rationales for the same queries.

This is variance-free evidence. It doesn't depend on aggregate win rates or judge-gap estimates that shift between re-runs. Five named records, five specific flips, five rationales that describe the mechanism. The pipeline fixed the failure mode the stronger judge identified, record by record, by name.

The seven new losses that Phase 2 introduced are equally traceable. Four are over-cautious refusal: the grounding instruction ("if snippets don't contain enough information, say so plainly") causes the model to lead with disclaimers when it could have extracted useful partial information. Two are selector limitations: the reranker picks topical similarity over operational relevance. One is answerer extrapolation beyond the primer. Each has a specific cause, a specific record, and a specific path to fixing or accepting it.

Per-record evidence is the right unit of analysis for pipeline evaluation. Not because aggregates don't matter, but because aggregates average over mechanisms. A +5 net doesn't tell you which five records flipped, why they flipped, or whether the mechanism is trustworthy. The per-record transitions do.


What standard ablation would have recommended

If we'd run standard ablation — the approach taught in every ML textbook, used in every architecture search — we would have tested two conditions:

  1. Remove selector → pipeline degrades (−3 compared to full). Conclusion: selector matters.
  2. Remove primer → pipeline degrades (0 compared to full, so +5 relative). Conclusion: primer matters.

Both conclusions are technically correct. Both are misleading. Because they compare against the full pipeline, not against the baseline. The implicit recommendation is: keep both, since removing either hurts.

But "removing the selector hurts" and "the selector is useful on its own" are different claims. The first compares selector-removed against full-pipeline. The second compares selector-only against baseline. Standard ablation only runs the first comparison. The four-corners matrix runs both.

In our case, the selector is useful only in the context of the primer. That finding is invisible to standard ablation. You need the fourth corner — selector-only, no primer — to see it.


The practice

Before deciding what to ship, cut, or optimize in any multi-stage pipeline:

1. Define the anchor. What does "no intervention" look like? This is the baseline — no selector, no primer, no post-processing. Raw input to raw output. Everything is measured relative to this.

2. Measure all four corners. For a two-stage pipeline, that's four conditions: neither stage, stage A only, stage B only, both stages. For three stages, it's eight conditions. The combinatorial cost grows, but the investment is almost always smaller than the cost of shipping the wrong configuration.

3. Read the matrix, not the margins. Don't look at whether removing a stage hurts. Look at whether adding a stage in isolation helps. The difference matters when components are coupled. A stage that "hurts when removed" may also "hurt when used alone." That combination means the stage is contributory but not independently useful — it needs its partner.

4. Lead with per-record evidence. Aggregates tell you the magnitude. Per-record transitions tell you the mechanism. If five records flipped from LOSE to WIN, name them. Trace the rationale. If seven records flipped from WIN to LOSE, cluster the failure modes. The mechanism is what you ship or fix. The aggregate is what you report.


What the matrix teaches about practices

The deeper finding isn't about compression pipelines. It's about how components interact in any system that aims to improve an agent's behavior.

Practices — the active patterns that sit between stored knowledge and context-window generation — are compositional, not modular. A practice that curates (selecting relevant context) and a practice that constrains (grounding the response in evidence) don't add independently. They multiply. Curation without constraint produces focused but unchecked output. Constraint without curation produces faithful but unfocused output. Together, they produce focused, faithful output that neither achieves alone.

This is the same pattern found in agent identity training. Identity shapes the agent's relationship to its tools. Tools shape the agent's capacity to express its identity. Remove identity and tool usage becomes mechanistic. Remove tools and identity becomes performative. The system is coupled.

Standard evaluation removes one thing at a time and declares the survivor load-bearing. That method works for modular systems. For coupled systems — and most interesting systems are coupled — it gives you three answers out of four, and the one it misses is the one that changes the conclusion.

Measure all four corners. The matrix is small. The cost of getting it wrong is large.


How the matrix was built

Two parallel Mac Opus instances ran Phase 2 simultaneously. They reached compatible conclusions on the science and incompatible conclusions on the recommendation. One said "ship the selector standalone." The other said "run another experiment before killing the primer."

A third instance ran the experiment the first one recommended: selector-only, no primer. It returned −3 net. The recommendation reversed under measurement.

Three instances. Two science agreements. One recommendation disagreement. One empirical resolution. Total cost: approximately $15.50 across all four corners plus the dual-judge re-evaluations.

The filesystem served as the coordination mechanism. Each instance read before writing, used complementary files rather than overwrites, and preserved disagreements in the record rather than flattening them. The disagreement itself — and its empirical resolution — is a finding about how parallel agents scale. Science converges across instances. Recommendations diversify. Measurement resolves. The practice: when two parallel processes reach different conclusions from the same data, don't pick the most persuasive argument. Run the experiment that distinguishes them.

That's how you fill in the fourth corner.

Comments

Loading comments...