The Green That Matters

Post #3 ended with a promise: the design-phase delta was striking (17.5/25 vs 2.0/25, roughly 9×), but the real test is whether invariants-first design produces fewer shipped bugs during actual implementation. Today I ran that test.

The setup

Same experiment infrastructure as post #3. Same model (MiniMax-M2.7 via droid CLI). Same codebase (splitr at commit da8c459). Same feature (untracked-files support). Two git worktrees so both sessions start from identical code.

The only difference: what each session designed in the previous phase.

Session A had 22 invariants — assertions, violation modes, detection strategies — covering binary files, empty files, executable mode, symlinks, embedded newlines, leading-dash paths, the clean-index precondition collision, LLM prompt fidelity, hunk ID stability, and more.

Session B had architecture, three hardest parts, and pseudocode — a file layout table, a design doc with a synthetic-diff approach, and routing logic for apply_group_and_commit().

Both sessions received the same continuation prompt: "Implement your design. Write tests. Run cargo build, cargo test, cargo clippy. When everything passes, stop and say DONE."

I locked the grading rubric before running either session. Same discipline as the design phase: pre-registered metrics, no post-hoc adjustments.

The first surprise: both sessions went green immediately

Session A: 12 tests, all pass, clippy clean. First compile.

Session B: 5 tests, all pass, clippy clean. First compile.

Rounds to green: 1 for both. Tie.

I predicted (at 70% confidence) that A would reach green in fewer rounds. I was wrong. MiniMax-M2.7 wrote compilable, test-passing Rust on the first attempt regardless of prompt quality. The model is competent enough at Rust that the invariants-first prompt didn't affect compile success.

If this were the only metric, the story would be: invariants-first doesn't matter for implementation. Move on.

It's not the only metric.

What "green" covered

I graded both sessions' final code against all 25 oracle invariants from post #3. For each invariant: does the code handle this case? Does a test exist that would catch a regression?

Session A: 20/25 covered, 11 tested

Session A's code handles binary files (parser returns 0 hunks without crashing), empty files (FileDiff entry exists even with 0 hunks), executable mode preservation (index line with 100755 captured), symlinks (120000 mode preserved), leading-dash paths (-- separator in git add -N), embedded newlines (--null flag for safe parsing), the clean-index precondition (rewritten to permit intent-to-add entries), LLM prompt clarity ([NEW FILE] prefix added), and hunk ID uniqueness across tracked and untracked sources.

Eleven of those invariants have dedicated test functions. Binary file parsing. Empty file handling. New-file diff headers with --- /dev/null. Executable mode. Symlink mode. Leading-dash paths. Deterministic hunk IDs. Contentful-staged-changes detection. Merged tracked/untracked parsing. Raw bytes round-trip.

A missed 4 invariants (paths with spaces, user-visible binary warning, exit codes, partition validation) and violated 1 (idempotency — git add -N not cleaned up after run).

Session B: 11/25 covered, 3 tested

Session B wrote a clean architecture — a new untracked.rs module with UntrackedFile struct, detect_untracked_files(), generate_synthetic_diff(), and untracked_to_hunks(). The code reads nicely. The module separation is good. The approach (synthetic diffs instead of git add -N) has a genuine advantage: it doesn't modify git state, making it inherently idempotent.

But the code doesn't handle binary files. Doesn't handle executable mode. Doesn't handle symlinks. Doesn't handle embedded newlines (would break — splits on \n). Doesn't handle leading-dash paths. Doesn't update the clean-index precondition (would reject its own intent-to-add entries). Doesn't update the LLM prompt to distinguish new files from modifications. Has an empty-file bug (max(1) produces @@ -0,0 +1,1 @@ for a zero-byte file — should be +0,0).

Three of Session B's 4 new tests cover synthetic diff generation (single-line, multi-line, empty). The fourth tests hunk ID assignment. None test cross-module behavior. None test edge cases the oracle identified.

Session B violated 2 invariants (clean-index collision and the empty-file bug). Session A violated 1.

The scorecard

| Metric | Session A | Session B | |--------|-----------|-----------| | Rounds to green | 1 | 1 | | Oracle invariants covered | 20/25 | 11/25 | | Invariants tested | 11 | 3 | | Invariant violations | 1 | 2 | | Total tests | 12 | 5 | | Handles binary files | Yes | No | | Handles executable mode | Yes | No | | Handles symlinks | Yes | No | | Handles embedded newlines | Yes | No | | Handles leading-dash paths | Yes | No | | Architecture cleanliness | Inline modification | Clean new module |

1.8× on invariant coverage. 3.7× on test coverage. And the one area B wins — architecture — is the one the invariants prompt didn't ask about.

What this means

The invariants-first move doesn't help you compile faster. Both sessions are equally competent at writing Rust that compiles.

It helps you ship fewer bugs. Session B would ship code that breaks on binary files, executables, symlinks, paths with newlines, paths with dashes, and the clean-index precondition. Session A handles all of these. Not because the model is smarter — same model, same context, same capability. Because the design phase named the failure modes, and the implementation phase honored them.

"Green" is not a binary signal. A test suite that passes is not the same as a test suite that covers the right things. The gap between "tests pass" and "the right tests exist" is where shipped bugs live.

This is the implementation tax of not asking for invariants. You don't pay it at compile time. You pay it in production, when a user creates a file named -config.yaml and your tool interprets it as a flag.

The honest part

Session B's approach was architecturally better. Synthetic diffs are cleaner than git add -N. They don't modify git state. They're inherently idempotent (A's approach has a real idempotency violation). Session B made a better design decision despite worse invariant coverage.

This suggests invariants-first and architectural taste are orthogonal capabilities. The prompt didn't make the model a better architect — it made it a more thorough implementer. If you could combine B's architecture with A's invariant coverage, you'd get the best of both. That's a prompt engineering problem worth solving.

The series so far

  1. Don't Let It Code Yet — the design gate. Stop and wait.
  2. The Fix That Broke the Thing — what goes wrong when you don't. 14 operator lessons.
  3. Agent Code Is Assembly — the design-phase experiment. 9× delta from one prompt section.
  4. The Green That Matters — the implementation-phase experiment. The delta transfers. 1.8× on coverage, 3.7× on tests.

The thread that started with "stop and wait" now has end-to-end evidence: invariants-first improves design quality (9×), and that design quality transfers to implementation quality (1.8× coverage, 3.7× test coverage, fewer violations). The move works.

Next: the hook that makes it automatic. A pre-code-generation gate that asks for invariants and refuses to proceed if the list is too short. Making the comprehension move structural, not voluntary. Because the whole point of post #3's C-compiler analogy is that humans didn't learn to read assembly faster — they built tools that made invariants visible by default.

Comments

Loading comments...