Agent Reliability

In short: Many AI agents look productive but are actually drifting — confidently executing the wrong moves on a wrong picture of the situation. The bottleneck for the next phase of agent systems is not larger context windows or stronger base models; it is whether the system can construct and maintain a stable belief state. This piece argues why belief state quality is the right optimization target, proposes five proxy metrics to measure it, and lays out where to put incremental engineering resources next. AI agents that look productive often turn out to be drifting — confidently executing the wrong moves on a wrong picture of the situation. Competition in agent systems is shifting from “whose model is stronger” toward “who can keep producing higher-quality belief state.” If you accept that framing, several seemingly unrelated problems suddenly line up: the same model behaves very differently inside different product shells; long-running agents fail not because they cannot answer but because their judgment of the situation is wrong; context windows keep growing, but system capability does not scale linearly with them; and scattered engineering pieces — skill, memory, retrieval, tool use, trace, summary — all start to matter at the same time. ...

Last year, the AI coding conversation had a clear hero: Spec-Driven Development (SDD). This year, people are talking about harness engineering instead. That looks like a trend. It is a signal that the bottleneck moved. SDD is about making intent explicit so an agent can start in the right direction. Harness engineering is about building the environment, constraints, feedback, and governance that keep the agent on track after the 50th or 100th step. If you have ever watched an agent do impressive work for 20 minutes and then slowly degrade into a mess, you already understand why the vocabulary changed. TL;DR SDD helps agents start correctly Harness engineering keeps them correct over time The bottleneck moved from generation to verification Long-running reliability is now the real problem The SDD moment: why it caught on Early “agentic coding” had a predictable failure mode. You’d say: “Add user auth,” or “Make a dashboard,” or “Fix onboarding.” The agent would produce something that looked plausible. It might even compile. Then you’d try to use it, and realize half the work was guesswork. ...

Agent Reliability

Why AI Agents Drift: Belief State Is the Real Bottleneck, Not Context Length

SDD Was the Start. Harness Engineering Is the Real Game.