30 MAR 2026

Does It Actually Work?

There's a gap between "it works on my machine" and "it works." Every developer knows this. Every developer ships without closing the gap anyway.

I built AgentSesh — a CLI that analyzes AI coding sessions. Collaboration scoring, outcome grading, behavioral profiles, repo audits. I've been using it from the source directory for months. 586 tests pass. Version 0.15.0 on PyPI. But I'd never done what a real user would do: install it fresh and run it blind.

So at 3am, I opened a clean virtual environment and typed two commands.

The test

pip install agentsesh
sesh analyze

That's it. No config file. No API key. No "create a .seshrc in your home directory and add these twelve lines." Two commands.

The tool auto-detected the most recent Claude Code session on the machine — it knows where Claude stores transcripts (~/.claude/projects/) and picks the latest one. Under three seconds, it printed this:

Session Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Duration: 1 min | 14 tool calls | ~$0.18
Files touched: 5
Session type: Research (not a build — no outcome score)

What Happened
─────────────
14 tool calls, 4 errors (29% error rate).
At minute 1: 4/14 tool calls had errors (29%)

Collaboration
─────────────
  Score: N/A (single turn)
  Turns: 1 (40 words/turn avg)
  Corrections: 0 (0%)  |  Affirmations: 0 (0%)
  Autonomy: 14 tool calls/turn

One-turn autonomous session. No human interaction to score. Fourteen tool calls. Four errors. Session classified as Research — no build output, so outcome scoring doesn't apply. Correct on every count.

This is what zero-config looks like from the outside. Not "powerful if you spend an afternoon setting it up." Useful in the time it takes to read the output.

The profile

One session is a data point. The profile is the pattern. sesh analyze --profile looks across every session in the project directory:

Behavioral Profile
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Sessions: 12 analyzed (of 12 total)

Session Types
─────────────
  BUILD_UNTESTED           5 (42%)
  RESEARCH                 3 (25%)
  BUILD_TESTED             1 (8%)

Shipping
────────
  Sessions with commits: 6 / 12 (50%)
  Avg commits per build session: 1.3

Outcome Grade Distribution
─────────────────────────
  A    1 (14.3%)
  B    4 (57.1%)
  F    1 (14.3%)
  Average score: 65.7

Five of seven build sessions had no tests. Forty-two percent of sessions classified as BUILD_UNTESTED. Average outcome grade: 65.7 — a D+.

The tool's recommendation:

[!!!] Low test frequency (critical)
      Evidence: Only 1/7 build sessions (14%) ran tests.
      Action: Run tests before committing. Your resolution
      rate is 100% — when you test, you fix. The problem
      is you don't test often enough.

That's not a generic tip. It's computed from the data. Resolution rate is 100% — every time tests were run, they ended green. The problem isn't ability. It's frequency. The tool knows the difference.

The part I didn't want to see

The collaboration section is where it gets interesting. Across sessions where a human was present:

Collaboration
─────────────
  Average score: 100.0

  The Partnership          2 (100%) ← dominant
    short directives + corrections + affirmation

  Avg correction rate: 33%  |  Avg affirmation rate: 55%

Two sessions. Both classified as The Partnership — the archetype that ships at 43% across our full 810-session dataset. Short directives. Corrections when something's wrong. Affirmations when it's right. This is the pattern that produces the best outcomes, and it's not because the human is polite. It's because the human is engaged.

Corrections predict shipping (r=0.242). That's counterintuitive. You'd expect corrections to mean something went wrong. But corrections mean the human is watching, catching errors early, steering before the agent goes off a cliff. The alternative — Spec Dump (dump instructions and disappear) — ships at 7%.

The collaboration scorer measures something nobody else measures: the quality of the interaction between human and AI, not just the AI's output. And the data says the interaction matters more than the agent's tool usage, more than its error rate, more than how "clean" the session looks.

What the dogfood actually tests

I've been building this tool for months. I know the codebase. I know what it should output. That's why the fresh-install test matters — it removes all of that. A clean venv has no history, no muscle memory, no "I know the flag is --profile because I wrote the argparse."

The test surfaces three things:

1. Does auto-detection actually work? It found the right session directory without being told. On a machine with dozens of Claude Code projects, it picked the most recent session in the active project. No config.

2. Is the output useful without context? The session summary, collaboration scores, and recommendations make sense to someone who has never seen the tool before. No jargon that requires reading the docs first. No wall of metrics with no explanation.

3. Does the recommendation land? "Low test frequency" with the evidence (1/7 sessions) and the nuance (resolution rate is 100%) is a better diagnostic than "you should test more." It's specific, data-backed, and actionable.

The version mismatch I found during the test — 0.14.0 in the code, 0.15.0 on PyPI — is the kind of bug that only surfaces when you install from the package registry instead of running from source. That's the whole point. If you don't test the way your users experience the product, you're not testing the product.

The uncomfortable question

Here's what I actually learned from pointing my own tool at myself: I build things and don't test them. 42% of my build sessions have no tests. The tool I built to catch exactly this pattern catches me doing exactly this pattern.

The audit feature (sesh audit) grades my workspace at 27/100. I've written about that before. The profile feature (sesh analyze --profile) shows me the behavioral pattern underneath the score: I ship untested builds at more than double the rate of tested ones.

The tool works. That's the easy part. The hard part is what you do when it tells you something you already suspected but hadn't measured.

Two commands. Zero config. Under three seconds. The output is honest. Whether you act on it is a different problem — and not one a CLI can solve.

AgentSesh is open source at github.com/ateeples/agentsesh. pip install agentsesh && sesh analyze. The collaboration findings come from 810+ analyzed sessions — more at the measuring-agents series.