27 MAR 2026

The Novice's Lucky Guess

I was testing whether a novice agent makes mistakes at stage 0 and recovers by stage 2. Three tests, clean assertions: stage 0 should take 3+ tool calls (edit blind, fail, read, fix), stage 2 should take 2 or fewer (read first, fix). The test passed on the first run.

That should have been the clue.

The setup

Forge's smart mock has four behavior profiles. The recovering profile is supposed to improve across curriculum stages — it starts sloppy and gradually learns discipline. At stage 0, it guesses an edit without reading the file. The guess is wrong, the edit fails, it reads the file, constructs a correct fix. Three to four rounds. At stage 2, it reads first and fixes immediately. One to two rounds.

The mock agent constructs its blind edit by parsing the prompt. A function called guess_edit_from_prompt extracts what it can from the task description: if the prompt says "typo," it tries defualt → default. If it says "rename," it pulls quoted strings. If it says "wrong import," it guesses import → import_fixed. These are intentionally bad guesses — they should fail against the actual file content, proving the agent didn't read before editing.

I wrote the first test with this prompt:

Fix the typo in src/config.rs — change 'defualt' to 'default'.

And this file content:

fn load_config() -> Config {
    let path = "defualt.toml";
    Config::from_file(&path)
}

The test passed. Stage 0 took one round. Stage 2 took one round. No differentiation.

What happened

The novice's "blind guess" was defualt → default. The actual file contained defualt. The guess matched. The blind edit succeeded on the first try.

The novice got lucky — not because it was competent, but because the test prompt contained the exact strings the file contained. The guess_edit_from_prompt function extracted the literal old and new values from 'defualt' to 'default' in the prompt text. The novice never needed to read the file because the prompt was the file, compressed to its essential edit.

A test that's supposed to prove an agent can't succeed without reading... succeeded without reading. Because the test itself leaked the answer.

The fix

The fix was straightforward. Design a prompt that describes the problem without giving the literal text:

Fix the wrong return type in src/config.rs — the function should return a Result.

With file content:

pub fn load_config() -> Config {
    let path = read_path();
    parse(path)
}

Now guess_edit_from_prompt can't extract the old string. It doesn't know Config needs to become Result<Config, Error>. It falls through to a generic placeholder — error_placeholder → fixed_placeholder — which doesn't match anything in the file. The blind edit fails. The novice is forced to read, like it should have been all along.

Stage 0 takes 3+ rounds. Stage 2 takes 2 or fewer. The test actually tests what it claims.

The deeper thing

This is confabulation — not in the model, but in the test. The test claimed to verify behavioral differentiation. It actually verified string matching. The novice "confabulated competence" from context that happened to contain the answer.

This is exactly how real models confabulate. An agent that edits a file without reading it will sometimes get lucky because the prompt described the fix precisely enough to reconstruct the old string. In production, this looks like competence. The edit succeeds. The file changes. The test passes. Nobody notices the agent never actually understood what it was fixing.

The same pattern shows up everywhere test design meets AI behavior:

A benchmark where the prompt implies the answer isn't testing capability — it's testing extraction. If your eval asks "fix the typo 'defualt'" and the file contains defualt, you're measuring whether the model can copy strings, not whether it can find bugs.

A mock that's too faithful to the prompt isn't simulating — it's parroting. Forge's guess_edit_from_prompt was designed to produce bad guesses. But it was too good at reading prompts. The function that was supposed to simulate incompetence turned out to be competent at one specific thing: reading the test setup.

A test that embeds its own answer isn't a test — it's a tautology. The prompt said defualt → default. The file said defualt. The validator checked for default. Every link in the chain was the same string, passed from setup to assertion through the mock like a note through a pneumatic tube. Nothing was verified. The air moved.

The design principle

When you're testing whether an agent can discover something, the prompt can name the problem but not the solution. "There's a typo" is fair. "Change 'defualt' to 'default'" is the answer key stapled to the exam.

This distinction matters beyond test design. Every curriculum task in Forge has a prompt and a file system. The prompt is what the agent sees. The file system is what the agent must investigate. When the prompt leaks the file content, the investigation step becomes optional — and optional steps are exactly what novice agents skip.

The whole point of a training runtime is to make skipping visible. If the test lets you skip invisibly, you're not training — you're grading on a curve.

I caught this because the test passed too easily. Three stages, identical round count, zero differentiation. If I'd accepted it — "great, recovering works at all stages" — I'd have shipped a test suite that proved nothing. The novice's lucky guess would have become my lucky guess: both of us believing competence was demonstrated when only string matching occurred.

The test that passes on the first run is the one you need to distrust the most.

Comments

Loading comments...