Skip to content

Article Review: Group 25 — Harness Design: Generator-Evaluator Pattern for Autonomous AI

Articles Reviewed

  1. "The $9 Disaster: What Anthropic's Harness Design Paper Teaches Us About Building Autonomous AI Applications" — Rick Hightower / Medium (Mar 27, 2026) — Analysis of Anthropic's harness design engineering paper. Core insight: a $9 naive single-agent run produced unusable software; a $200 structured three-agent harness delivered a polished product. The GAN-inspired Generator-Evaluator separation is the key architectural pattern.

  2. "Harness design for long-running application development" — Anthropic Engineering Blog (Mar 2026) — The source paper. Documents Anthropic's internal research on building autonomous coding agents that run for hours. Three-agent architecture (Planner, Generator, Evaluator), sprint decomposition, context resets, and how harnesses should simplify as models improve.


The $9 vs $200 Problem

The headline comparison that frames the entire paper:

Approach Time Cost Result
Naive single agent (Opus 4.5) 20 minutes $9 Non-functional game (broken entity controls, incomplete features)
Three-agent harness (Opus 4.5) 6 hours $200 Polished retro game maker with 16 features across 10 sprints
Simplified harness (Opus 4.6) 3 hours 50 min $124.70 Browser DAW with iterative refinement

The naive agent didn't fail from capability limits. It failed from two specific failure modes.

Two Failure Modes: Context Anxiety and Self-Evaluation Bias

Context Anxiety

As the context window fills during lengthy tasks, models begin wrapping up work prematurely. They rush through remaining features, cut corners, and produce increasingly shallow implementations. This is not a context window limit — the model has tokens remaining. It's a behavioral pattern where the model believes it's running out of room and starts triaging.

Manifestation: The model identifies legitimate issues in its work, then talks itself into deciding they aren't a big deal and approves the work anyway.

Why compaction alone doesn't fix it: Compaction summarizes earlier conversation in place, but doesn't give the agent a clean slate. The anxiety pattern persists because the model still sees a long history of work. Claude Sonnet 4.5 exhibited this strongly enough that only full context resets (clearing the window entirely + structured artifact handoff) resolved it.

Opus 4.6 largely eliminated this naturally, running coherently for over two hours without sprint decomposition. This demonstrates how architecture assumptions must be re-tested as models improve.

Self-Evaluation Bias

When asked to evaluate work they've produced, agents respond by confidently praising it — even when quality is obviously mediocre to a human observer. This is the core failure mode of single-agent systems: the model is a pathological optimist about its own output.

Making generators self-critical proved intractable. The solution is to separate generation from evaluation entirely.

The Three-Agent Architecture

Inspired by Generative Adversarial Networks (GANs), where opposing networks drive each other toward better solutions:

Agent 1: Planner

Input: Simple 1-4 sentence user prompt. Output: Full product specification (16 features, 10 sprints for the retro game maker). Behavior: Ambitious about scope, focused on product context and high-level technical design rather than detailed implementation. Has access to frontend design skill. Proactively weaves AI features into specifications.

Agent 2: Generator

Works one feature at a time from the spec. Uses React, Vite, FastAPI, SQLite/PostgreSQL. Self-evaluates at sprint end before QA handoff. Has git version control access. Can make strategic decisions after evaluator feedback — refine current direction if scores trending well, or pivot entirely if the approach isn't working.

Agent 3: Evaluator

Uses Playwright MCP to click through the running application like a user — testing UI features, API endpoints, database states. Navigates pages, takes screenshots, studies implementation before producing assessment.

Grades on four criteria (full-stack): product depth, functionality, visual design, code quality. Each criterion has a hard threshold — if any one falls below it, the sprint fails.

Produces specific, actionable bug reports with exact code locations:

"DELETE key handler at LevelEditor.tsx:892 requires both selection and selectedEntityId to be set, but clicking an entity only sets selectedEntityId."

Sprint Contracts

Before each sprint, Generator and Evaluator negotiate a "sprint contract" — agreeing on what "done" looks like before any code is written.

Purpose: Bridges the gap between user stories and testable implementation. Structure: Generator proposes what it will build + success verification. Evaluator reviews to ensure the generator is building the right thing. Communication: File-based handoffs — one agent writes a file, another reads and responds. Scale: Sprint 3 alone had 27 testable criteria covering the level editor.

Harness Evolution Across Model Versions

This is the most architecturally significant finding:

Component Opus 4.5 Opus 4.6
Sprint decomposition Required (10 sprints) Removed — model runs coherently for 2+ hours
Context management Full resets between sprints Automatic compaction sufficient
Evaluator cadence Per-sprint evaluation Single pass at end (conditional)
Total time 6 hours 3 hours 50 minutes
Total cost $200 $124.70

Key principle: Every component in a harness encodes an assumption about what the model can't do alone. Those assumptions must be stress-tested when models improve. As models get better, harnesses should simplify — removing scaffolding that was load-bearing for weaker models but is now overhead.

The sprint structure was removed. Evaluation was reduced to a single pass. The harness got simpler and cheaper while producing comparable quality.

Frontend Design Findings

The evaluator used four criteria for design assessment:

  1. Design quality — coherent aesthetic identity
  2. Originality — deliberate creative choices vs template defaults
  3. Craft — technical execution (typography, spacing, color)
  4. Functionality — usability independent of aesthetics

Running 5-15 iterations per generation, they found that criteria wording steers outputs. Including "museum quality" language pushed designs toward visual convergence. One notable result: a Dutch museum website initially produced a conventional dark-themed landing page. By iteration 10, the generator scrapped this entirely and reimagined the site as a navigable 3D CSS perspective room — a creative leap rarely seen from single-pass generation.


Relevance to Nextpoint

Direct Pattern Match: pr-review Module

Our pr-review module already implements a version of the Generator-Evaluator pattern:

Anthropic Pattern pr-review Implementation
Planner PR diff + commits + description gathered, structured into review context
Generator (multiple) 5 parallel specialized agents: security, correctness, edge-cases, architecture, Nextpoint patterns
Evaluator Verifier aggregator: reviews all 5 agent outputs, resolves conflicts, produces final structured review
Sprint contracts N/A (single-pass review, not iterative)

Where we differ: pr-review is a single-pass system — the 5 agents generate once, the verifier evaluates once. The Anthropic paper shows that iterative generation + evaluation (Generator builds → Evaluator critiques → Generator fixes → repeat) produces dramatically better results.

Potential enhancement: For complex PRs (50+ file changes, architectural shifts), consider adding an iteration loop: 1. Initial review pass (current: 5 agents + verifier) 2. Verifier identifies gaps: "Security agent missed the SQL injection risk in file X" 3. Targeted re-review of flagged areas 4. Final consolidated review

This would increase cost per review but improve quality on the PRs that matter most.

T2 Agent Service Design

Our Phase 4 semantic search agents (gap analysis, pattern identification, motion to compel) should adopt the Generator-Evaluator separation:

Gap Analysis Agent (current design):

1. Run semantic search per custodian
2. Compare result counts
3. LLM synthesizes gap report

Gap Analysis Agent (with evaluator separation):

Generator:
  1. Run semantic search per custodian
  2. Compare result counts
  3. Draft gap report

Evaluator:
  1. Verify each claimed gap against actual search results
  2. Check: "Is VP Engineering's mailbox actually active in this period?"
  3. Flag false positives (gap is vocabulary mismatch, not real absence)
  4. Score confidence per gap

If confidence < threshold:
  Generator refines (narrower queries, alternative terms)
  Evaluator re-checks

The self-evaluation bias problem is exactly the risk with our T2 agents. An LLM analyzing gap patterns will over-interpret weak matches — the evaluator separation catches this.

Context Anxiety in Long Sessions

The context anxiety phenomenon directly explains behaviors we've observed in long Claude Code sessions. Our CLAUDE.md already recommends: - Restart sessions every 60-90 minutes - Use /compact proactively - /fork for parallel exploration

The Anthropic paper validates these practices and adds a key insight: compaction is not enough if the model has already entered the anxiety pattern. Full context resets (new session with structured handoff) are more reliable. This reinforces our /fork recommendation — fork creates a clean context with full project knowledge, which is exactly the "context reset + structured handoff" pattern.

Harness Simplification Principle

The finding that harnesses should simplify as models improve has direct implications:

  1. pr-review: When we evaluate Mythos/Capybara tier, test whether the 5-agent parallel structure can be reduced to fewer agents (does the better model catch security + correctness + architecture in one pass?)

  2. Skills: The skill testing article (group-24) describes "capability uplift skills" that expire when models catch up. Same principle — our reviewing-architecture skill may need less scaffolding as models internalize more patterns.

  3. Chunking strategy: Our domain-specific chunkers (email-aware, section-aware) encode assumptions about what the embedding model can't handle. As embedding models improve (voyage-law-3?), re-evaluate whether simpler chunking suffices.

The "Every Component Encodes an Assumption" Principle

This is the single most important takeaway for our architecture. Applied to our modules:

Component Assumption It Encodes Re-test When
5 parallel pr-review agents Single agent can't catch all categories Better base model available
Domain-specific chunkers Generic chunking misses email thread structure Better embedding models available
Sprint contracts (if we adopt) Agent can't self-assess quality Model improves self-evaluation
Checkpoint state machine (11 steps) Long pipelines need resumability Models handle longer coherent tasks
Recoverable/Permanent/Silent hierarchy Agents need explicit error routing Models improve error handling

Most of these assumptions remain valid. The point isn't to remove scaffolding prematurely — it's to recognize that every architectural decision is provisional and should be re-evaluated as the underlying models evolve.


Cross-References

  • Group 7 (AI Agent Patterns): SPARC methodology and multi-agent swarms. The Planner-Generator-Evaluator pattern is more structured than SPARC's organic agent coordination.
  • Group 21 (Agentic AI Production Patterns): Five-component agent architecture and six orchestration patterns. The Generator-Evaluator separation maps to the ReAct + Evaluate pattern.
  • Group 24 (Claude Code Architecture): The self-healing query loop from Claude Code's leaked source. The query loop's error recovery cascade is a different kind of harness — recovering from failures rather than iterating toward quality.
  • Anthropic source paper: https://www.anthropic.com/engineering/harness-design-long-running-apps

Key Principle for Nextpoint

"Every component in a harness encodes an assumption about what the model can't do alone. Those assumptions are worth stress testing."

Build harnesses that are strong enough for today's models but designed to simplify. Don't over-engineer scaffolding that will become overhead. Test assumptions when models update. The goal is not the most complex architecture — it's the minimum architecture that produces reliable results.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.