Multi-Agent Harness

Beyond Two Agents

The initializer/coder harness solved the multi-session problem, but hit a performance plateau. Two interconnected challenges remained:

Frontend design quality — Models produce functional but generic designs. Subjective quality (“is this design good?”) resists simple verification.
Self-evaluation bias — Agents asked to evaluate their own work confidently praise mediocre outputs. This is especially pronounced for subjective tasks where no binary test exists.

The breakthrough: draw inspiration from Generative Adversarial Networks (GANs) — separate the agent doing the work from the agent judging the work.

Three-Agent Architecture

Planner Agent

Expands a simple 1–4 sentence user prompt into a full product specification:

Focuses on product context and high-level technical design, not granular implementation details
Emphasizes scope ambition while avoiding cascading errors from overly-specific upfront decisions
Identifies opportunities to weave AI-powered features into the product

The planner deliberately avoids micro-managing implementation. Over-specified plans create brittleness — if one detail is wrong, downstream work inherits the error. Instead, the planner sets direction and lets the generator make tactical decisions.

Generator Agent

Implements the application feature-by-feature, applying the one-feature-at-a-time approach from the long-running agent pattern:

Works in focused build rounds, self-evaluating before QA handoff
Uses git for version control and rollback
Runs coherently for hours — Claude Opus 4.6 eliminated the need for sprint decomposition

Evaluator (QA) Agent

Uses browser automation (Playwright MCP) to interact with the running application like a real user:

Tests UI features, API endpoints, and database state
Grades each criterion against hard thresholds
Provides detailed, actionable feedback for the generator
Can fail a build round, forcing the generator to fix issues before proceeding

The evaluator is the critical innovation. By separating generation from evaluation, the harness avoids the self-evaluation trap: it’s easier to tune a standalone evaluator to be skeptical than to make a generator self-critical.

Making Subjective Quality Measurable

For frontend design, the evaluator uses four explicit criteria:

Criterion	Weight	What It Measures
Design quality	High	Does the design cohere as a whole? Colors, typography, layout combine into a distinct identity
Originality	High	Evidence of custom decisions vs. template defaults and “AI slop” patterns
Craft	Normal	Technical execution — spacing consistency, color harmony, contrast ratios
Functionality	Normal	Usability — can users understand the interface and complete tasks?

Design quality and originality are weighted higher because models already perform well on craft and functionality naturally. The weighting pushes the model toward aesthetic risk-taking and away from the generic purple-gradient- over-white-cards pattern.

Sprint Contract Protocol

Before each build round, the generator and evaluator negotiate a sprint contract:

Generator proposes what will be built and how completion will be verified
Evaluator reviews the proposal — is the right thing being built?
Both iterate until agreement on what “done” looks like

This bridges the gap between user stories and testable implementation. Without it, the evaluator might judge against criteria the generator never intended to address in that round.

With Claude Opus 4.6, sprint decomposition was removed entirely. The model handles task coherence natively over long sessions. The evaluator’s value becomes task-dependent rather than universally required.

Iteration Dynamics

The generator–evaluator loop runs 5–15 iterations per build round. Key observations:

Evaluator assessments improve over iterations before plateauing
Prompt wording steers outputs in unexpected ways — phrases like “museum quality” pushed toward visual convergence rather than diversity
Improvement isn’t linearly clean — sometimes middle iterations outperform final ones
Implementation complexity increases across rounds as the generator tackles more ambitious solutions
First iterations already exceed baseline — the criteria language itself steers models away from generic defaults, even before evaluator feedback

Results

Retro Video Game Maker

Prompt: “Create a 2D retro game maker with level editor, sprite editor, entity behaviors, and playable test mode.”

Metric	Solo Agent	Full Harness
Duration	20 minutes	6 hours
Cost	$9	$200
Features	1 spec feature	16 features, 10 sprints
Core bug	Entity input broken, game runtime wiring failed	Core gameplay functional, entities respond to input
Design	Wasted space, rigid layout	Full viewport, consistent visual identity
AI	None	AI-assisted sprite generation

Digital Audio Workstation (Simplified Harness)

After removing sprint decomposition for Opus 4.6:

Prompt: “Build a fully featured DAW in the browser using the Web Audio API.”

Phase	Duration	Cost
Planner	4.7 min	$0.46
Build Round 1	2 hours 7 min	$71.08
QA Round 1	8.8 min	$3.24
Build Round 2	1 hour 2 min	$36.89
QA Round 2	6.8 min	$3.09
Build Round 3	10.9 min	$5.88
QA Round 3	9.6 min	$4.06
Total	3 hours 50 min	$124.70

The generator ran coherently for over two hours without sprint decomposition. The final application included a working arrangement view, mixer, transport, and an AI agent that could compose songs autonomously.

QA Agent Challenges

Building an effective evaluator required its own iteration:

Initial QA agents showed poor judgment — identified legitimate issues but rationalized them away, tested superficially rather than probing edge cases
Tuning loop: read evaluator logs → identify judgment divergences from human standards → update QA prompts
Remaining gaps: layout issues, unintuitive interactions, bugs in deeply nested features

The evaluator is not a solved problem — it’s a continuously tuned component. But even imperfect evaluation dramatically outperforms self-evaluation.

Harness Simplification Principle

The most important takeaway for harness engineers:

“Every harness component encodes assumptions about model capabilities. These assumptions warrant stress testing — they may be incorrect and can quickly become obsolete as models improve.”

When Claude Opus 4.6 arrived with improved long-horizon planning and debugging, the sprint construct was removed. The result: simpler harness, comparable or better output.

The workflow: always experiment with the target model → read traces on realistic problems → tune performance → when new models arrive, re-examine the harness and strip non-load-bearing components.

Better models need less scaffolding. But they also create space for more ambitious harnesses achieving capabilities impossible before. The work of harness engineering is finding the right balance — neither over-constraining a capable model nor under-supporting a limited one.

Source

Harness Design for Long-Running Application Development — Prithvi Rajasekaran, Anthropic, March 2026.