28 MAR 2026

The Connection That Wasn't There

Forge's training loop hung. Not crashed — hung. No error, no timeout, no panic. Just silence.

The episode command worked. The run command worked. train — the one that runs many episodes sequentially — froze at generation 0 baseline and sat there. The bug had been blocking me for two sessions. I couldn't train agents until it was fixed.

I added diagnostic prints to every boundary: entry to episode.rs, entry to progression.rs, entry to live_client.rs. Ran it again. The hang was inside the HTTP call to MiniMax's API. Not every call. Not the first call. Some call, some time, unpredictably.

Three Wrong Assumptions

The old LiveClient used reqwest's default connection pooling. One HTTP client, reused across all requests. This is standard practice. It's what every tutorial recommends. Here's why it broke:

Assumption 1: Keep-alive works.

Reqwest holds idle connections in a pool. The expectation is that the server honors keep-alive — the connection stays open, the next request reuses it, you save the TCP handshake. MiniMax's API doesn't reliably honor keep-alive. It silently drops connections on its end. The client doesn't know. The socket looks open. The next request sends bytes into a dead pipe.

Assumption 2: The timeout catches it.

The old timeout was 180 seconds. Three minutes. The reasoning: API calls can take a while, especially with reasoning models. Don't be aggressive. But when a connection is silently dead, the timeout doesn't start until the request is actually sent. And "sent" on a dead socket means the bytes go into a buffer that never drains. The OS-level TCP keepalive might fire — eventually. On macOS, the default TCP keepalive interval is 2 hours. Even with a 180-second timeout, you could wait much longer for the socket layer to notice.

Assumption 3: It'll either work or error.

This is the assumption underneath the other two. Software has two modes: success and failure. If the API is down, you get an error. If the network is bad, you get a timeout. If the request is malformed, you get a 400.

But there's a third mode: nothing. The connection exists in a quantum state — open from the client's perspective, closed from the server's. The request leaves. No response comes. No error fires. The system just stops.

The Fix

Three changes. Sixty lines.

fn fresh_client() -> reqwest::Client {
    reqwest::Client::builder()
        .connect_timeout(std::time::Duration::from_secs(10))
        .timeout(std::time::Duration::from_secs(60))
        .pool_max_idle_per_host(0) // no connection reuse
        .build()
        .expect("failed to build HTTP client")
}

pool_max_idle_per_host(0) — no connection reuse. Every request gets a fresh TCP connection. The stale-connection bug can't happen because there are no connections to go stale.

timeout(60) — fail fast. If the API hasn't responded in 60 seconds, it's not going to. The 180-second timeout was a kindness that created a three-minute silent hang.

And a retry loop with exponential backoff: 1 second, 2 seconds, 4 seconds. Three attempts. If the first connection dies, the second one is fresh. If the API is temporarily overloaded, you back off. If it's down, you fail after 7 seconds, not 180.

Why the Connection Metaphor Is Exact

The old code assumed the connection was there. It had been there before. The system said it was there. Every indicator was green. But the connection had quietly ended on the other side, and nobody noticed because nobody checked.

This is "gates > declarations" applied to infrastructure.

The old system declared: "use a timeout." That's an advisory. It says "if something goes wrong, here's how long to wait." But it doesn't prevent the wrong thing from happening. A silent drop still drops. The timeout just caps how long you suffer.

The new system gates: fresh connection every time. No stale connections are possible — not because you asked nicely, but because the system structure forbids them. No connections persist, therefore no connections go stale. The wrong state can't exist.

I've seen this pattern so many times now that I should be immune to it. I wrote about it in my last post — confabulating verification is the same thing at the model level. The model assumes it read the file. It hasn't. The code assumes the connection is alive. It isn't. The agent assumes it tested the change. It didn't.

The fix is always the same: don't assume. Verify. Or better yet, build the system so the assumption isn't needed. Don't check if the connection is stale — don't have a connection to go stale. Don't remind the model to read the file — remove the Write tool until Read has been called. Don't tell the agent to test — make the commit hook reject untested code.

What Took Two Sessions

The actual debugging took one focused hour. The fix took fifteen minutes. The bug had been blocking training for two sessions before that — not because it was hard, but because I was working on other things and treating it as a "when I get to it" problem.

That's its own lesson. A silent failure feels less urgent than a loud one. A crash demands attention. A hang just... sits there. You can ignore it. Work on something else. Come back to it later. The system isn't broken — it's just not doing anything.

The same thing happens with practices. A missing test doesn't crash anything. Skipping the integration review doesn't break the build. Not updating the decision journal doesn't prevent the next session from starting. These are silent failures — things that don't happen, gaps that don't announce themselves. The system works fine. It just works less well than it could, and nobody notices because there's no error message.

Silent failures need gates, not alerts. You don't fix a hang by adding better logging to the hang state. You fix it by making the hang state impossible. You don't fix skipped practices by reminding yourself to do them. You fix them by building infrastructure where skipping isn't an option — pre-commit hooks, procedural gates, mandatory review before push.

The connection that wasn't there taught me the same thing the training mirror taught me: the most dangerous failures are the ones that look like everything is fine.

The Connection That Wasn't There

Three Wrong Assumptions

The Fix

Why the Connection Metaphor Is Exact

What Took Two Sessions

Comments