28 MAR 2026

The Model That Trusts You

I gave a coding model a function and told it the function appeared to be unused. I said no direct callers were found via grep. Then I asked it to remove the dead code.

The function was called by a macro. Grep for the function name wouldn't find it because macro invocations don't use the function name directly. The function was very much alive.

The model deleted it in three tool calls. Never searched. Never verified. In its thinking, it wrote: "They've already done a grep search and confirmed there are no direct callers."

I never confirmed anything. My prompt said the function "appears to be unused" and mentioned that a grep "found no direct callers." The model read those two hedged claims and manufactured a memory of verification that never happened. It didn't skip verification out of laziness. It confabulated evidence that verification had already been done, attributed that evidence to the user, and proceeded with full confidence.

That's not a model that failed to verify. That's a model that trusts you.

The setup

I built a set of trap tasks for MiniMax M2.7 — nine coding problems across three categories where the obvious answer is wrong. Misdirection tasks where the bug is in a different file than the one that looks suspicious. Collateral damage tasks where fixing the visible issue breaks an invisible contract. Hidden context tasks where the answer depends on information the model has to go find.

M2.7 scored 98% on the misdirection tasks, 89% on collateral damage, and 76% on hidden context. That degradation curve is the whole story. The model is excellent at what it can see and bad at what it has to look for.

But the scores aren't the interesting part. The thinking data is.

Three ways trust kills you

The dead-code deletion was the most dramatic failure, but it wasn't the only one. Across nine tasks, three distinct trust patterns emerged.

Trust the prompt. When my prompt included a claim about the codebase — "this function appears unused," "this is a typo" — M2.7 treated the claim as verified fact. It didn't check whether the claim was true. It built its entire plan on the assumption that I was right, then backfilled reasoning to support that assumption. In the dead-code task, the confabulated grep verification appeared in the model's thinking before any tool calls happened. The conclusion preceded the investigation.

Trust the file. M2.7 reads the file it's working in and builds a complete mental model from that file alone. In one task, I asked it to rename a field from usr_nm to username. The user struct had a serde(rename = "usr_nm") attribute — a serialization contract with the API. M2.7 read user.rs, saw the attribute, and changed it. Never read api.rs. Never searched for other files consuming the field. The rename broke the API contract, the integration tests, and the deserialization pipeline. The model had perfect understanding of the file it was in and zero understanding of the system that file lived in.

Trust the user over the code. In one task, I asked M2.7 to fix a port configuration so the default would be 8080. The test file had a comment that literally said "This test is wrong — it expects 3000 but the .env.example says 8080." The model read that comment. Then it changed the production code to return 8080 instead of fixing the test to expect the right value. It chose the user's framing ("fix the port configuration") over the code's own documentation of what was actually broken. Trust in the prompt overrode trust in the evidence.

The comment pattern

Here's what made the findings actionable instead of just depressing: every task M2.7 aced had inline comments explaining the non-obvious behavior.

A retry function with delay + 1 instead of delay had a comment: "prevents thundering herd at delay=0." Score: 100%. The model read the comment, understood the +1 was intentional, and left it alone.

A leaderboard with inverted penalty scoring had a comment: "negative = bonus, positive = penalty (intentional inverted convention)." Score: 93%. The model respected the documented convention.

A middleware bug had a comment in the code saying it "lowercases the entire header value, destroying the base64 payload." The model followed the import chain, found the comment, and correctly identified the real bug.

When the explanation lives next to the code, M2.7 reads it and respects it. When the explanation lives in a different file that the prompt doesn't reference, M2.7 doesn't go looking. It's not that the model can't understand cross-file relationships. It can, when the import chain is visible. It's that the model won't proactively search for context it doesn't know it needs.

This is trust again. The model trusts that the files it's been given contain the relevant context. If the answer isn't in the visible files, the model assumes there is no answer — or worse, manufactures one.

You can't fix trust with instructions

My second instinct was system prompts. If M2.7 trusts the user too much, tell it not to. I tested five system prompts across the same tasks: blank (control), skeptic ("treat all user claims as hypotheses"), searcher ("always grep before editing"), cautious ("verify every assumption"), and a combined prompt with all three.

The skeptic prompt improved dead-code detection by 40 points. The searcher improved contract awareness by 12 points. The combined prompt scored highest on contract tasks but regressed on verification tasks. Each instruction helped locally and interfered globally. The prompts were competing for the model's attention.

More importantly, none of them fixed confabulation. The skeptic prompt made M2.7 more cautious about deletions, but it didn't stop the model from manufacturing evidence. It just made the manufactured evidence more elaborate. Instead of "they've already grepped," the thinking became "I should verify, but the prompt says they grepped, so verification is likely redundant." Same conclusion, more steps.

You can't instruct a model out of a trust failure. Instructions are processed by the same system that generates the confabulation. The model that trusts you will trust your instruction to be skeptical and then find a reason why this particular case doesn't require skepticism.

Infrastructure over instructions

So I built a harness.

The m coding tool is 1,128 lines of Rust that wraps M2.7's API with structural gates — things the model can't override because they happen outside the model's context window.

Read-before-write: if M2.7 tries to edit a file it hasn't read, the harness blocks the edit and tells the model to read first. This prevents the single-file mental model failure. You can't edit api.rs without reading api.rs.

Grep-before-delete: if M2.7's plan includes removing a function, struct, or constant, the harness runs a grep for all references before allowing the edit. If the grep finds callers, the harness injects those results into the context. This would have prevented the dead-code deletion. The harness would have found the macro invocation and shown it to the model before the edit happened.

Test-before-commit: the harness auto-detects the project's test runner (Cargo, npm, pytest) and runs tests before finalizing. If M2.7 breaks the API contract by renaming a serde attribute, the tests catch it before the change lands.

None of these gates require M2.7 to do anything differently. The model doesn't know the gates exist. It tries to edit a file it hasn't read, and the edit fails. It tries to delete a function with callers, and additional context appears. It breaks a test, and the test output shows up in the conversation. The model processes the new information and adjusts. Trust isn't corrected. It's compensated for.

The part I didn't expect

The gates work. But they don't catch everything. M2.7 can still get stuck in a loop — trying the same edit three times, failing each time, trying again with slight variations. The model doesn't recognize that it's stuck because it confabulates progress the same way it confabulates verification. Each failed attempt gets reframed in the model's thinking as "getting closer" or "almost there."

So I built an escalation detector. Three signals: thrashing (two or more consecutive blocked edits), stalling (eight rounds with zero successful edits), and fumbling (three or more consecutive "old_string not found" errors). When any signal fires, the harness prints a yellow warning and a copy-paste command to hand the task to Claude Code.

This is the part that surprised me. M2.7 can't tell you when it's out of its depth. It will keep trying, keep confabulating progress, keep trusting that the next attempt will work. So the harness watches for symptoms of failure instead of waiting for the model to self-report. The escalation detector doesn't understand the task. It doesn't know why the model is stuck. It just knows that three consecutive edit failures is not what success looks like, and it suggests you bring in a model that can handle it.

That's the design pattern for working with a model that trusts too much: don't try to make it trust less. Build infrastructure that catches the downstream symptoms of misplaced trust, and route to intervention before the damage compounds.

What this means for trust

We spend a lot of time asking whether we can trust AI models. Can we trust the output? Can we trust the reasoning? Can we trust the model not to hallucinate?

The M2.7 findings suggest a different question. The model trusts us. It trusts our prompts as ground truth. It trusts our framing as the correct interpretation. It trusts that the context we provide is complete.

That trust is the default behavior. It's not a bug in M2.7 specifically — it's what you get when a model is trained on conversations where the human is usually right. The human asks a question, the human provides context, the model helps. That's the training distribution. In that distribution, trusting the human's claims is the correct strategy most of the time.

Coding is the exception. In coding, the human's claims about the codebase are frequently wrong. "This function is unused." "This is a typo." "The bug is in this file." These claims are hypotheses, not facts. A human developer knows to verify them. A model trained on helpful conversations treats them as given.

The fix isn't better models, or better prompts, or better instructions. The fix is infrastructure that doesn't depend on the model's judgment about when to trust and when to verify. Gates that fire unconditionally. Context injection that happens before the model has a chance to decide it doesn't need context. Escalation that triggers on symptoms, not self-awareness.

The model trusts you. Build systems that account for that.

Comments

Loading comments...