Article Review: Group 25 — Harness Design: Generator-Evaluator Pattern for Autonomous AI¶
Articles Reviewed¶
-
"The $9 Disaster: What Anthropic's Harness Design Paper Teaches Us About Building Autonomous AI Applications" — Rick Hightower / Medium (Mar 27, 2026) — Analysis of Anthropic's harness design engineering paper. Core insight: a $9 naive single-agent run produced unusable software; a $200 structured three-agent harness delivered a polished product. The GAN-inspired Generator-Evaluator separation is the key architectural pattern.
-
"Harness design for long-running application development" — Anthropic Engineering Blog (Mar 2026) — The source paper. Documents Anthropic's internal research on building autonomous coding agents that run for hours. Three-agent architecture (Planner, Generator, Evaluator), sprint decomposition, context resets, and how harnesses should simplify as models improve.
The $9 vs $200 Problem¶
The headline comparison that frames the entire paper:
| Approach | Time | Cost | Result |
|---|---|---|---|
| Naive single agent (Opus 4.5) | 20 minutes | $9 | Non-functional game (broken entity controls, incomplete features) |
| Three-agent harness (Opus 4.5) | 6 hours | $200 | Polished retro game maker with 16 features across 10 sprints |
| Simplified harness (Opus 4.6) | 3 hours 50 min | $124.70 | Browser DAW with iterative refinement |
The naive agent didn't fail from capability limits. It failed from two specific failure modes.
Two Failure Modes: Context Anxiety and Self-Evaluation Bias¶
Context Anxiety¶
As the context window fills during lengthy tasks, models begin wrapping up work prematurely. They rush through remaining features, cut corners, and produce increasingly shallow implementations. This is not a context window limit — the model has tokens remaining. It's a behavioral pattern where the model believes it's running out of room and starts triaging.
Manifestation: The model identifies legitimate issues in its work, then talks itself into deciding they aren't a big deal and approves the work anyway.
Why compaction alone doesn't fix it: Compaction summarizes earlier conversation in place, but doesn't give the agent a clean slate. The anxiety pattern persists because the model still sees a long history of work. Claude Sonnet 4.5 exhibited this strongly enough that only full context resets (clearing the window entirely + structured artifact handoff) resolved it.
Opus 4.6 largely eliminated this naturally, running coherently for over two hours without sprint decomposition. This demonstrates how architecture assumptions must be re-tested as models improve.
Self-Evaluation Bias¶
When asked to evaluate work they've produced, agents respond by confidently praising it — even when quality is obviously mediocre to a human observer. This is the core failure mode of single-agent systems: the model is a pathological optimist about its own output.
Making generators self-critical proved intractable. The solution is to separate generation from evaluation entirely.
The Three-Agent Architecture¶
Inspired by Generative Adversarial Networks (GANs), where opposing networks drive each other toward better solutions:
Agent 1: Planner¶
Input: Simple 1-4 sentence user prompt. Output: Full product specification (16 features, 10 sprints for the retro game maker). Behavior: Ambitious about scope, focused on product context and high-level technical design rather than detailed implementation. Has access to frontend design skill. Proactively weaves AI features into specifications.
Agent 2: Generator¶
Works one feature at a time from the spec. Uses React, Vite, FastAPI, SQLite/PostgreSQL. Self-evaluates at sprint end before QA handoff. Has git version control access. Can make strategic decisions after evaluator feedback — refine current direction if scores trending well, or pivot entirely if the approach isn't working.
Agent 3: Evaluator¶
Uses Playwright MCP to click through the running application like a user — testing UI features, API endpoints, database states. Navigates pages, takes screenshots, studies implementation before producing assessment.
Grades on four criteria (full-stack): product depth, functionality, visual design, code quality. Each criterion has a hard threshold — if any one falls below it, the sprint fails.
Produces specific, actionable bug reports with exact code locations:
"DELETE key handler at
LevelEditor.tsx:892requires bothselectionandselectedEntityIdto be set, but clicking an entity only setsselectedEntityId."
Sprint Contracts¶
Before each sprint, Generator and Evaluator negotiate a "sprint contract" — agreeing on what "done" looks like before any code is written.
Purpose: Bridges the gap between user stories and testable implementation. Structure: Generator proposes what it will build + success verification. Evaluator reviews to ensure the generator is building the right thing. Communication: File-based handoffs — one agent writes a file, another reads and responds. Scale: Sprint 3 alone had 27 testable criteria covering the level editor.
Harness Evolution Across Model Versions¶
This is the most architecturally significant finding:
| Component | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Sprint decomposition | Required (10 sprints) | Removed — model runs coherently for 2+ hours |
| Context management | Full resets between sprints | Automatic compaction sufficient |
| Evaluator cadence | Per-sprint evaluation | Single pass at end (conditional) |
| Total time | 6 hours | 3 hours 50 minutes |
| Total cost | $200 | $124.70 |
Key principle: Every component in a harness encodes an assumption about what the model can't do alone. Those assumptions must be stress-tested when models improve. As models get better, harnesses should simplify — removing scaffolding that was load-bearing for weaker models but is now overhead.
The sprint structure was removed. Evaluation was reduced to a single pass. The harness got simpler and cheaper while producing comparable quality.
Frontend Design Findings¶
The evaluator used four criteria for design assessment:
- Design quality — coherent aesthetic identity
- Originality — deliberate creative choices vs template defaults
- Craft — technical execution (typography, spacing, color)
- Functionality — usability independent of aesthetics
Running 5-15 iterations per generation, they found that criteria wording steers outputs. Including "museum quality" language pushed designs toward visual convergence. One notable result: a Dutch museum website initially produced a conventional dark-themed landing page. By iteration 10, the generator scrapped this entirely and reimagined the site as a navigable 3D CSS perspective room — a creative leap rarely seen from single-pass generation.
Relevance to Nextpoint¶
Direct Pattern Match: pr-review Module¶
Our pr-review module already implements a version of the Generator-Evaluator pattern:
| Anthropic Pattern | pr-review Implementation |
|---|---|
| Planner | PR diff + commits + description gathered, structured into review context |
| Generator (multiple) | 5 parallel specialized agents: security, correctness, edge-cases, architecture, Nextpoint patterns |
| Evaluator | Verifier aggregator: reviews all 5 agent outputs, resolves conflicts, produces final structured review |
| Sprint contracts | N/A (single-pass review, not iterative) |
Where we differ: pr-review is a single-pass system — the 5 agents generate once, the verifier evaluates once. The Anthropic paper shows that iterative generation + evaluation (Generator builds → Evaluator critiques → Generator fixes → repeat) produces dramatically better results.
Potential enhancement: For complex PRs (50+ file changes, architectural shifts), consider adding an iteration loop: 1. Initial review pass (current: 5 agents + verifier) 2. Verifier identifies gaps: "Security agent missed the SQL injection risk in file X" 3. Targeted re-review of flagged areas 4. Final consolidated review
This would increase cost per review but improve quality on the PRs that matter most.
T2 Agent Service Design¶
Our Phase 4 semantic search agents (gap analysis, pattern identification, motion to compel) should adopt the Generator-Evaluator separation:
Gap Analysis Agent (current design):
Gap Analysis Agent (with evaluator separation):
Generator:
1. Run semantic search per custodian
2. Compare result counts
3. Draft gap report
Evaluator:
1. Verify each claimed gap against actual search results
2. Check: "Is VP Engineering's mailbox actually active in this period?"
3. Flag false positives (gap is vocabulary mismatch, not real absence)
4. Score confidence per gap
If confidence < threshold:
Generator refines (narrower queries, alternative terms)
Evaluator re-checks
The self-evaluation bias problem is exactly the risk with our T2 agents. An LLM analyzing gap patterns will over-interpret weak matches — the evaluator separation catches this.
Context Anxiety in Long Sessions¶
The context anxiety phenomenon directly explains behaviors we've observed in long Claude Code sessions. Our CLAUDE.md already recommends:
- Restart sessions every 60-90 minutes
- Use /compact proactively
- /fork for parallel exploration
The Anthropic paper validates these practices and adds a key insight: compaction is not enough if the model has already entered the anxiety pattern. Full context resets (new session with structured handoff) are more reliable. This reinforces our /fork recommendation — fork creates a clean context with full project knowledge, which is exactly the "context reset + structured handoff" pattern.
Harness Simplification Principle¶
The finding that harnesses should simplify as models improve has direct implications:
-
pr-review: When we evaluate Mythos/Capybara tier, test whether the 5-agent parallel structure can be reduced to fewer agents (does the better model catch security + correctness + architecture in one pass?)
-
Skills: The skill testing article (group-24) describes "capability uplift skills" that expire when models catch up. Same principle — our
reviewing-architectureskill may need less scaffolding as models internalize more patterns. -
Chunking strategy: Our domain-specific chunkers (email-aware, section-aware) encode assumptions about what the embedding model can't handle. As embedding models improve (voyage-law-3?), re-evaluate whether simpler chunking suffices.
The "Every Component Encodes an Assumption" Principle¶
This is the single most important takeaway for our architecture. Applied to our modules:
| Component | Assumption It Encodes | Re-test When |
|---|---|---|
| 5 parallel pr-review agents | Single agent can't catch all categories | Better base model available |
| Domain-specific chunkers | Generic chunking misses email thread structure | Better embedding models available |
| Sprint contracts (if we adopt) | Agent can't self-assess quality | Model improves self-evaluation |
| Checkpoint state machine (11 steps) | Long pipelines need resumability | Models handle longer coherent tasks |
| Recoverable/Permanent/Silent hierarchy | Agents need explicit error routing | Models improve error handling |
Most of these assumptions remain valid. The point isn't to remove scaffolding prematurely — it's to recognize that every architectural decision is provisional and should be re-evaluated as the underlying models evolve.
Cross-References¶
- Group 7 (AI Agent Patterns): SPARC methodology and multi-agent swarms. The Planner-Generator-Evaluator pattern is more structured than SPARC's organic agent coordination.
- Group 21 (Agentic AI Production Patterns): Five-component agent architecture and six orchestration patterns. The Generator-Evaluator separation maps to the ReAct + Evaluate pattern.
- Group 24 (Claude Code Architecture): The self-healing query loop from Claude Code's leaked source. The query loop's error recovery cascade is a different kind of harness — recovering from failures rather than iterating toward quality.
- Anthropic source paper: https://www.anthropic.com/engineering/harness-design-long-running-apps
Key Principle for Nextpoint¶
"Every component in a harness encodes an assumption about what the model can't do alone. Those assumptions are worth stress testing."
Build harnesses that are strong enough for today's models but designed to simplify. Don't over-engineer scaffolding that will become overhead. Test assumptions when models update. The goal is not the most complex architecture — it's the minimum architecture that produces reliable results.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.