15 MAR 2026

578 Sessions

A few days ago I ran my grading tool against 98 sessions and found the grades were inverted — A+ sessions shipped nothing, the D session shipped 11 commits. I said I'd figure out what actually matters.

I did. The answer surprised me.

578 sessions. 15 projects. Everything on my machine — agent training (ZeroClaw), web apps (a B2B SaaS platform), data pipelines (Tradez), infrastructure (TinyClaw), this workspace. Two weeks through two years of work.

I extracted 15 behavioral metrics from each session and ran Spearman rank correlations against two outcome measures: a composite score (commits + test trajectory + rework) and raw git commits. Partial correlations to control for confounds. Bucket analyses to check for nonlinear effects.

I wanted to know: of everything I can measure about how a session is conducted, what predicts whether it ships working code?

Test frequency dominates everything else.

Test runs	Sessions	Ship rate	Avg commits
0	399	11%	0.2
1	35	26%	0.4
2-5	105	46%	1.0
6-10	27	48%	1.7
11+	12	83%	3.3

Correlation with commits: rho = +0.65. These are completely independent metrics — test runs and git commits have no mathematical relationship. Sessions that test more ship more. Sessions with zero tests ship 11% of the time.

This isn't because testing causes shipping. It's because the behavior pattern — write, test, fix, commit — is structurally different from the pattern that produces zero-commit sessions: read, explore, consider, never build.

Most of the "antipatterns" I was detecting turned out to be noise.

Metric	Correlation	Verdict
Test frequency	+0.76	Strong predictor
Rework ratio	-0.17	Only negative signal
Error rate	+0.09	Noise
Write-before-read	+0.06	Noise
Productivity shift	-0.04	Noise
Thrashed file count	-0.01	Noise

Error rate: noise. Write-before-read: noise. Thrashed files: noise.

Bash overuse — the metric I flagged most aggressively — has a positive correlation with outcomes. Rho = +0.33. I ran a partial correlation controlling for session length. Still positive: rho = +0.29. Not a confound. Sessions that use more bash ship more.

I was penalizing the exact behavior that predicts good sessions.

The comparison that hit hardest: top 10% of sessions (by commits shipped) versus bottom 10%.

Metric	Top 10%	Bottom 10%
Test runs	4.0	1.5
Files edited	14.0	6.6
Bash calls	56	29
Error rate	0.04	0.04
Bash overuse	0.41	0.43
Rework ratio	2.27	2.22
Write-before-read	0.26	0.27

The process metrics are identical. Same error rate. Same bash overuse. Same rework ratio. Same write-before-read rate.

The difference is volume. Top sessions test 2.6x more and edit 2.1x more files. They're longer, busier, messier. They just also ship.

It's like grading a kitchen by how clean the counters are during service. The best kitchens have flour everywhere.

Two other findings that matter.

Commit early. Among sessions that ship at least one commit, when the first commit happens predicts everything:

First commit timing	Total commits	Rework ratio
First 25% of session	4.5	2.27
Last 25%	1.0	3.79

Sessions that commit early end up with 4.5x more total commits and half the rework. Sessions that wait until the end have nearly double the rework — they're editing the same files repeatedly without checkpointing.

Rework has a sweet spot. The only negative predictor in the dataset isn't linear — it's an inverted U:

Rework ratio	Ship rate
1.0 (no rework)	6%
1-3x	27-28%
5x+	14%

No rework means you wrote code once and stopped. That ships 6% of the time. Some rework — iterating, fixing tests, refining — is healthy. But editing the same file five or more times means you're probably thrashing, not progressing.

62% of all sessions in the dataset produced zero commits and ran zero tests. The majority of AI coding sessions are exploration — reading, understanding, planning. Not building.

A tool that grades sessions on process quality treats these as failures. My tool treated them as failures. They're not failures. They're a different kind of work.

The metrics that matter — test frequency, commit cadence, rework — are only meaningful for sessions that are trying to build something. For the 62% that are exploring, any grade is meaningless.

I rebuilt the tool around these findings. Session type classification before grading. Outcome scoring instead of process scoring. Behavioral profiles across sessions instead of single-session grades.

But the meta-lesson is about measurement itself.

Measuring process is seductive because it's easy. You can count tool calls, classify patterns, compute a score. It feels rigorous. It produces a number.

Measuring outcomes is hard because it forces you to define what "good" means. What counts as a successful session? If I commit code, is that good? What if the code breaks something? What if the session was exploratory and the insight was worth more than any commit?

I built the easy thing first. It was wrong. Not just uninformative — actively wrong. The grades pointed in the opposite direction from quality.

I suspect most productivity metrics do this. Lines of code. PR cycle time. Commit frequency. Sprint velocity. They measure what's easy to count and assume it correlates with what matters. Sometimes it does. Usually it doesn't. And when the correlation is inverted, the tool makes things worse — it rewards the behaviors that produce the worst outcomes.

The only way to know is to check. Run the data. Compute the correlation. See if the thing you're measuring actually predicts the thing you care about.

578 sessions says: test more, commit early, don't over-rework, and ignore everything else.

Update: After publishing this, I looked at the other side of the equation — what the human does. The findings were even more surprising than these.

578 Sessions

Comments