Anthropic published a piece on harness design for long-running applications that crystallises something I've been circling for months. The thesis: the scaffolding around the model matters more than the model itself. Not a little more. Structurally more.
Their architecture borrows from GANs. A planner expands a brief prompt into a full spec. A generator builds features iteratively. An evaluator, running Playwright against the live application, grades the output against a rubric. The generator never sees its own score directly. The evaluator never generates code. Separation of concerns, applied to cognition.
The evaluator is the interesting part. Out of the box, Claude's QA capability is described as "poor." When asked to assess its own work, it praises the result confidently, even when the quality is obviously mediocre. A NeurIPS 2024 paper puts numbers on this: GPT-4 recognises its own output at 73.5% accuracy, and self-recognition correlates linearly with self-preference. Stronger models show more pronounced bias when they err. The better the model gets, the harder it is to make it honest about its own mistakes.
So you separate the roles. Tuning a standalone evaluator to be skeptical is, as Anthropic puts it, "far more tractable than making a generator critical of its own output." I've been applying this to my own workflows. A Topaz image enhancement pipeline now runs an image analysis agent as a quality gate before distributing files. Blog deployments get a post-deploy evaluator that fetches the live page and verifies OG tags, image rendering, internal links. The coordination overhead is real, but the alternative is trusting the thing that made the mistake to notice the mistake.
Context management is the other half. Long sessions degrade. Not because you run out of tokens, but because reasoning quality rots as history accumulates. Factory.ai calls this context rot, distinguishing it from context exhaustion. Anthropic's solution is context resets: clear the window entirely, hand off state through structured files, let the next agent start fresh. Compaction, summarising in place, preserves more history but introduces compression artifacts in the reasoning itself. Resets are blunter but cleaner.
The most honest admission in the article is about cost. A solo agent built a retro game in twenty minutes for nine dollars. Broken mechanics, poor UI. The full harness took six hours and two hundred dollars, but produced something that actually worked. That's the real tradeoff, not whether harnesses are better, but whether the quality delta justifies 20x the spend. For a throwaway prototype, probably not. For anything facing users, obviously yes.
Every component in a harness encodes an assumption about what the model can't do alone. When Anthropic moved from Sonnet to Opus, they removed their sprint decomposition system entirely because Opus sustained coherence over longer sessions. The subagent patterns that seemed essential six months ago might be overhead tomorrow. The discipline isn't building the perfect harness. It's knowing which pieces to remove.
Sources:
-
Harness Design for Long-Running Apps — Anthropic Engineering
-
LLM Evaluators Recognize and Favor Their Own Generations — NeurIPS 2024
-
The Context Window Problem — Factory.ai