Multi-Agent Harness

GAN-inspired three-agent architecture — planner, generator, and evaluator working together to build complete applications over multi-hour autonomous sessions

Beyond Two Agents

The initializer/coder harness solved the multi-session problem, but hit a performance plateau. Two interconnected challenges remained:

  1. Frontend design quality — Models produce functional but generic designs. Subjective quality (“is this design good?”) resists simple verification.
  2. Self-evaluation bias — Agents asked to evaluate their own work confidently praise mediocre outputs. This is especially pronounced for subjective tasks where no binary test exists.

The breakthrough: draw inspiration from Generative Adversarial Networks (GANs) — separate the agent doing the work from the agent judging the work.


Three-Agent Architecture

GAN-Inspired Three-Agent Harness Separate planning, generation, and evaluation User Prompt (1-4 sentences) Planner Agent Expand prompt → full product spec High-level design, scope ambition, AI feature opportunities product spec Build-Evaluate Loop (per round) Sprint Contract — negotiate "done" criteria Generator Agent Implement feature-by-feature Git versioning, self-evaluate Evaluator (QA) Agent Playwright MCP — test as user Grade, critique, pass/fail build feedback 5-15 iterations per round Design Quality ★ Originality ★ Craft Functionality Complete Application

Planner Agent

Expands a simple 1–4 sentence user prompt into a full product specification:

  • Focuses on product context and high-level technical design, not granular implementation details
  • Emphasizes scope ambition while avoiding cascading errors from overly-specific upfront decisions
  • Identifies opportunities to weave AI-powered features into the product

The planner deliberately avoids micro-managing implementation. Over-specified plans create brittleness — if one detail is wrong, downstream work inherits the error. Instead, the planner sets direction and lets the generator make tactical decisions.

Generator Agent

Implements the application feature-by-feature, applying the one-feature-at-a-time approach from the long-running harness pattern:

  • Works in focused build rounds, self-evaluating before QA handoff
  • Uses git for version control and rollback
  • Runs coherently for hours — Claude Opus 4.6 eliminated the need for sprint decomposition

Evaluator (QA) Agent

Uses browser automation (Playwright MCP) to interact with the running application like a real user:

  • Tests UI features, API endpoints, and database state
  • Grades each criterion against hard thresholds
  • Provides detailed, actionable feedback for the generator
  • Can fail a build round, forcing the generator to fix issues before proceeding

The evaluator is the critical innovation. By separating generation from evaluation, the harness avoids the self-evaluation trap: it’s easier to tune a standalone evaluator to be skeptical than to make a generator self-critical.


Making Subjective Quality Measurable

For frontend design, the evaluator uses four explicit criteria:

CriterionWeightWhat It Measures
Design qualityHighDoes the design cohere as a whole? Colors, typography, layout combine into a distinct identity
OriginalityHighEvidence of custom decisions vs. template defaults and “AI slop” patterns
CraftNormalTechnical execution — spacing consistency, color harmony, contrast ratios
FunctionalityNormalUsability — can users understand the interface and complete tasks?

Design quality and originality are weighted higher because models already perform well on craft and functionality naturally. The weighting pushes the model toward aesthetic risk-taking and away from the generic purple-gradient- over-white-cards pattern.


Sprint Contract Protocol

Before each build round, the generator and evaluator negotiate a sprint contract:

  1. Generator proposes what will be built and how completion will be verified
  2. Evaluator reviews the proposal — is the right thing being built?
  3. Both iterate until agreement on what “done” looks like

This bridges the gap between user stories and testable implementation. Without it, the evaluator might judge against criteria the generator never intended to address in that round.

With Claude Opus 4.6, sprint decomposition was removed entirely. The model handles task coherence natively over long sessions. The evaluator’s value becomes task-dependent rather than universally required.


Iteration Dynamics

The generator–evaluator loop runs 5–15 iterations per build round. Key observations:

  • Evaluator assessments improve over iterations before plateauing
  • Prompt wording steers outputs in unexpected ways — phrases like “museum quality” pushed toward visual convergence rather than diversity
  • Improvement isn’t linearly clean — sometimes middle iterations outperform final ones
  • Implementation complexity increases across rounds as the generator tackles more ambitious solutions
  • First iterations already exceed baseline — the criteria language itself steers models away from generic defaults, even before evaluator feedback

Results

Retro Video Game Maker

Prompt: “Create a 2D retro game maker with level editor, sprite editor, entity behaviors, and playable test mode.”

MetricSolo AgentFull Harness
Duration20 minutes6 hours
Cost$9$200
Features1 spec feature16 features, 10 sprints
Core bugEntity input broken, game runtime wiring failedCore gameplay functional, entities respond to input
DesignWasted space, rigid layoutFull viewport, consistent visual identity
AINoneAI-assisted sprite generation

Digital Audio Workstation (Simplified Harness)

After removing sprint decomposition for Opus 4.6:

Prompt: “Build a fully featured DAW in the browser using the Web Audio API.”

PhaseDurationCost
Planner4.7 min$0.46
Build Round 12 hours 7 min$71.08
QA Round 18.8 min$3.24
Build Round 21 hour 2 min$36.89
QA Round 26.8 min$3.09
Build Round 310.9 min$5.88
QA Round 39.6 min$4.06
Total3 hours 50 min$124.70

The generator ran coherently for over two hours without sprint decomposition. The final application included a working arrangement view, mixer, transport, and an AI agent that could compose songs autonomously.


QA Agent Challenges

Building an effective evaluator required its own iteration:

  • Initial QA agents showed poor judgment — identified legitimate issues but rationalized them away, tested superficially rather than probing edge cases
  • Tuning loop: read evaluator logs → identify judgment divergences from human standards → update QA prompts
  • Remaining gaps: layout issues, unintuitive interactions, bugs in deeply nested features

The evaluator is not a solved problem — it’s a continuously tuned component. But even imperfect evaluation dramatically outperforms self-evaluation.


Harness Simplification Principle

The most important takeaway for harness engineers:

“Every harness component encodes assumptions about model capabilities. These assumptions warrant stress testing — they may be incorrect and can quickly become obsolete as models improve.”

When Claude Opus 4.6 arrived with improved long-horizon planning and debugging, the sprint construct was removed. The result: simpler harness, comparable or better output.

The workflow: always experiment with the target model → read traces on realistic problems → tune performance → when new models arrive, re-examine the harness and strip non-load-bearing components.

Better models need less scaffolding. But they also create space for more ambitious harnesses achieving capabilities impossible before. The work of harness engineering is finding the right balance — neither over-constraining a capable model nor under-supporting a limited one.


Source

Harness Design for Long-Running Application Development — Prithvi Rajasekaran, Anthropic, March 2026.

Was this page helpful?