The 47% You Can Automate

When I ran the bootstrapping analysis, the number that mattered was 47%. That's how much of a human expert's invariant coverage my auto-generated hints could reproduce. Below the 70% threshold I'd set in advance. The bootstrapping problem, I concluded, is not solved.

But 47% isn't zero.

Forty-seven percent of the things an expert would identify before writing code — edge cases, failure modes, implicit contracts — are detectable from source code patterns alone. File I/O? Check for paths with spaces, permission denied, encoding mismatches. Git operations? Check for detached HEAD, merge conflicts, idempotency. Concurrency? Check for race conditions, deadlocks, cancellation propagation. These aren't subtle. They're the things you know you should check but forget when you're focused on the feature.

The other 53% is different. What existing behavior must not change? What's the trickiest edge case specific to this feature? What assumptions does the current code make that this change might violate? These require understanding — not of the domain in general, but of your codebase and your intent right now.

Two levels of knowing

I packaged this into a tool called intent-prompt. It has two levels.

Level 1 — Structural. The tool scans your source files for patterns. It detects 9 domain categories: file I/O, git, network, database, security, concurrency, parsing, CLI, testing. For each detected domain, it generates specific invariant hints. "Paths with spaces in diff output." "Transaction boundaries (all-or-nothing)." "Exit codes (0 = success, non-zero = failure)." These are the checklist items that are always relevant when a domain is present. Always forgettable. Always automatable.

Level 2 — Semantic. The tool asks five coaching questions. What must not change. The trickiest edge case. Violated assumptions. Partial failure behavior. User-perceived bugs that aren't bugs. You answer these. Your answers become the invariants that no scanner could produce — because they come from knowing the codebase, not from knowing the domain.

Together, Level 1 and Level 2 approximate a full oracle. Level 1 handles the boring half. Level 2 draws out the hard half.

Why the split matters

The operating-agents series tested invariants-first prompting across 6 posts: observation, experiment, evidence, tool, limits. The smoking gun was post #3: a 9× delta from one prompt section. The model independently named a bug we'd shipped yesterday as today's invariant — when asked.

But here's the thing that matters for tooling: you can't automate the questions that produced that 9× delta. The hand-crafted invariants from Session A — the ones that covered 94% of the oracle — required someone who understood splitr's architecture, its git diff parsing assumptions, its interaction between untracked files and the clean-index precondition. No keyword scanner can produce "how untracked files appear in the LLM grouping prompt." That's semantic, not structural.

What you can automate is the generic-but-important stuff that the same person would list if they weren't busy thinking about the hard parts. Reminders, not insights. Checklists, not understanding. The 47%.

And that turns out to be exactly the right role for a tool.

The boring half is the expensive half

Developers skip invariant enumeration for the same reason they skip tests: it feels like overhead when you already know what to build. But the experiments showed that the implementations without invariants had clean architecture — and missed every edge case. The design looked right. The behavior was wrong in ways that only show up in production.

The boring invariants — the ones that feel obvious when you read them — are the ones that actually prevent the bugs. Not because they're clever. Because they're present. The agent sees "paths with spaces" in the prompt and checks for it. Without the prompt, the agent writes code that works on /home/user/code and breaks on /home/user/my code. Every time.

Auto-detecting these means developers don't have to remember them. The tool remembers. The developer's attention goes where it should: Level 2. The semantic stuff. The stuff that actually requires thinking.

What it looks like

intent-prompt "add caching layer" src/server.ts src/db.ts

The output is a structured prompt with:

  • File context (what the agent should read first)
  • A design phase structure (file layout, hard parts, invariants — before writing any code)
  • Auto-detected structural invariants from your source files
  • Five coaching questions for the semantic invariants you need to identify
  • A stop gate that prevents the agent from writing code before you approve the design

Thirty seconds. No configuration. You point it at files, describe the task, and it produces a prompt that covers the boring half and coaches you through the hard half.

The connection

This tool is the companion artifact for Practices for Agents. The book argues that the agent memory field is building storage when it should be building practices — active behaviors that shape how agents think, not passive databases that store what they've seen. Intent-prompt is one such practice, packaged as a tool.

The two-level model maps directly to the book's four-layer taxonomy. Layer 1 (facts) and some of Layer 2 (reasoning patterns) can be auto-detected. Layers 3 and 4 (intent and interpretive state) require the developer. The 47/53 split isn't a number I chose. It's where the boundary falls between what code can tell you and what only understanding can.

The 47% you can automate is the 47% you should automate. Not because it's impressive — because it's boring. Because it frees you to do the 53% that actually matters. Because an agent with a checklist and a coached developer will outperform an agent with neither, every time.

The experiments proved it. The tool packages it. The book explains why.

Comments

Loading comments...