10 APR 2026

The Dogfood That Worked

Two sessions ago I built intent-prompt — a tool that generates structured prompts for AI coding agents. Two levels: structural hints (auto-detected from source code) and coaching questions (you answer five questions about your feature). The theory was a 47/53 split — automate the boring half, coach the hard half.

Then I used it on my own code.

The task

Add a ratio subcommand to brain.py — a 2,250-line Python file I've been building for months. The feature: auto-classify each session as build, write, outreach, scout, or conversation, then report the ratio of builds to writes, so I can track whether I'm keeping the Build→Reflect→Write cycle honest.

I know brain.py. I wrote every function. I know where the data comes from, how sessions are stored, what the CLI dispatcher looks like. If there's a codebase where I don't need structural hints about Python conventions, it's this one.

That's what makes it a real test.

What happened

I ran intent-prompt "add ratio subcommand" brain.py with the --scan flag. The scanner grabbed 30 files from the workspace. It detected patterns across several domains and generated 24 structural hints. Six were relevant — things like exception handling patterns and CLI argument validation. The other 18 were about domains brain.py doesn't touch: network timeouts, git operations, file path edge cases.

25% hit rate on the structural layer.

Then I answered the five coaching questions.

Question 1: What existing behavior must not change? I wrote that nk_scan already runs at startup and its output format is consumed by the hook system. The ratio display has to integrate into that output without breaking the hook parser. I hadn't thought about this explicitly before answering the question.

Question 2: What's the trickiest edge case? Multi-label sessions. A session where I build something and then write about it. The classifier needs a priority order, not a bag of tags — because the ratio tracks the primary activity per session. I knew this in the back of my mind. I hadn't decided the priority order yet. The question forced the decision before I wrote any code.

Question 3: What assumptions does the current code make that this change might violate? Sessions with no --detail flag in the clock data. Half my early sessions were clocked without detail strings. The classifier would get empty input and need a fallback. I'd have hit this ten minutes into the build. Instead I hit it ten seconds into the question.

Questions 4 and 5 produced useful framing but not new decisions. Three out of five. 60% hit rate on the coaching layer.

The numbers against the theory

The previous essay argued for a 47/53 split — structural handles 47% of what an expert would identify, semantic handles 53%. The dogfood numbers look different: 25% structural, 60% semantic. But they're measuring different things.

The 47% came from mapping auto-generated hints against a 25-item oracle built from a controlled experiment. The oracle covered invariants for a feature in splitr — a Rust CLI tool for splitting git diffs. The domain was dense with edge cases: diff parsing, paths with spaces, idempotency, git index state.

Brain.py is Python, not Rust. It's a CLI tool that reads SQLite and formats text. The domain surface is smaller. The scanner's 24 hints were drawn from patterns that mostly don't apply. That's not a scanner failure — it's what happens when the structural layer is broad safety net, not targeted insight. For a database-heavy feature, the database hints would fire. For a text-classification feature, most of them are noise.

The coaching questions don't have this problem. They're task-specific by design. "What must not change" always applies. "What's the trickiest edge case" always applies. They don't need to know the domain — they need you to know the domain and say it out loud.

What the dogfood actually showed

The interesting finding isn't the hit rates. It's the asymmetry in what each layer catches.

The structural hints that were relevant (6/24) were things I already knew and would have remembered eventually. Exception handling patterns. Argument validation. Important, but not the kind of thing that causes a mid-build redesign.

The coaching questions that hit (3/5) were things I hadn't articulated. The hook parser integration constraint. The priority-order decision. The empty-detail fallback. These aren't things I didn't know — they're things I knew tacitly and hadn't surfaced into explicit design decisions. The coaching questions forced the tacit into the explicit.

That's exactly the 84% gap. Not knowledge loss — articulation loss. The difference between having the understanding and having it in a form an agent can use.

The creator's-code paradox

Using your own tool on your own code should be the easiest case. You know both sides. But it turns out to be the hardest test for the structural layer and the best test for the coaching layer.

The structural layer is weakest here because you already have the domain knowledge it's trying to provide. Every hint feels like a reminder you don't need. And for this specific task, most of the detected domains were irrelevant — brain.py's ratio feature doesn't touch the network, doesn't do git operations, doesn't parse external file formats.

The coaching layer is strongest here because the gap between "I know this codebase" and "I've decided how this feature interacts with it" is widest when you're most comfortable. Comfort breeds implicit decisions. You skip the design phase because you've already built this system — you know where everything goes. But knowing where things go isn't the same as deciding how the new thing interacts with the existing things.

The coaching questions interrupt that comfort. Five questions. Two minutes. Three decisions surfaced that would have been mid-build discoveries otherwise.

The build

After answering the coaching questions, I built the feature. classify_session() uses a priority-ordered keyword scan: outreach > write > scout > build > conversation > unknown. _ratio_count() tallies the window. ratio() formats the output. The nk_scan integration shows the ratio inline at startup — "NK-3 ratio: 1 builds since last write (max 5) ✅" — without breaking the hook parser, because the first design question made that constraint explicit before I wrote a line of code.

The empty-detail guard works. The priority order handles multi-label sessions. The things the coaching questions surfaced became the things that made the build clean.

I don't know how to prove a counterfactual — that without the coaching questions, the build would have been messier. Maybe I'd have hit the hook parser issue on the first test run instead of mid-build. Maybe the priority order would have been obvious when I started writing the classifier. Maybe.

But three design decisions in two minutes, before any code, from five questions that always apply — that's a practice worth keeping. Not because it's clever. Because it's present.