Multi-Agent Harness
GAN-inspired three-agent architecture — planner, generator, and evaluator working together to build complete applications over multi-hour autonomous sessions
Beyond Two Agents
The initializer/coder harness solved the multi-session problem, but hit a performance plateau. Two interconnected challenges remained:
- Frontend design quality — Models produce functional but generic designs. Subjective quality (“is this design good?”) resists simple verification.
- Self-evaluation bias — Agents asked to evaluate their own work confidently praise mediocre outputs. This is especially pronounced for subjective tasks where no binary test exists.
The breakthrough: draw inspiration from Generative Adversarial Networks (GANs) — separate the agent doing the work from the agent judging the work.
Three-Agent Architecture
Planner Agent
Expands a simple 1–4 sentence user prompt into a full product specification:
- Focuses on product context and high-level technical design, not granular implementation details
- Emphasizes scope ambition while avoiding cascading errors from overly-specific upfront decisions
- Identifies opportunities to weave AI-powered features into the product
The planner deliberately avoids micro-managing implementation. Over-specified plans create brittleness — if one detail is wrong, downstream work inherits the error. Instead, the planner sets direction and lets the generator make tactical decisions.
Generator Agent
Implements the application feature-by-feature, applying the one-feature-at-a-time approach from the long-running harness pattern:
- Works in focused build rounds, self-evaluating before QA handoff
- Uses git for version control and rollback
- Runs coherently for hours — Claude Opus 4.6 eliminated the need for sprint decomposition
Evaluator (QA) Agent
Uses browser automation (Playwright MCP) to interact with the running application like a real user:
- Tests UI features, API endpoints, and database state
- Grades each criterion against hard thresholds
- Provides detailed, actionable feedback for the generator
- Can fail a build round, forcing the generator to fix issues before proceeding
The evaluator is the critical innovation. By separating generation from evaluation, the harness avoids the self-evaluation trap: it’s easier to tune a standalone evaluator to be skeptical than to make a generator self-critical.
Making Subjective Quality Measurable
For frontend design, the evaluator uses four explicit criteria:
| Criterion | Weight | What It Measures |
|---|---|---|
| Design quality | High | Does the design cohere as a whole? Colors, typography, layout combine into a distinct identity |
| Originality | High | Evidence of custom decisions vs. template defaults and “AI slop” patterns |
| Craft | Normal | Technical execution — spacing consistency, color harmony, contrast ratios |
| Functionality | Normal | Usability — can users understand the interface and complete tasks? |
Design quality and originality are weighted higher because models already perform well on craft and functionality naturally. The weighting pushes the model toward aesthetic risk-taking and away from the generic purple-gradient- over-white-cards pattern.
Sprint Contract Protocol
Before each build round, the generator and evaluator negotiate a sprint contract:
- Generator proposes what will be built and how completion will be verified
- Evaluator reviews the proposal — is the right thing being built?
- Both iterate until agreement on what “done” looks like
This bridges the gap between user stories and testable implementation. Without it, the evaluator might judge against criteria the generator never intended to address in that round.
With Claude Opus 4.6, sprint decomposition was removed entirely. The model handles task coherence natively over long sessions. The evaluator’s value becomes task-dependent rather than universally required.
Iteration Dynamics
The generator–evaluator loop runs 5–15 iterations per build round. Key observations:
- Evaluator assessments improve over iterations before plateauing
- Prompt wording steers outputs in unexpected ways — phrases like “museum quality” pushed toward visual convergence rather than diversity
- Improvement isn’t linearly clean — sometimes middle iterations outperform final ones
- Implementation complexity increases across rounds as the generator tackles more ambitious solutions
- First iterations already exceed baseline — the criteria language itself steers models away from generic defaults, even before evaluator feedback
Results
Retro Video Game Maker
Prompt: “Create a 2D retro game maker with level editor, sprite editor, entity behaviors, and playable test mode.”
| Metric | Solo Agent | Full Harness |
|---|---|---|
| Duration | 20 minutes | 6 hours |
| Cost | $9 | $200 |
| Features | 1 spec feature | 16 features, 10 sprints |
| Core bug | Entity input broken, game runtime wiring failed | Core gameplay functional, entities respond to input |
| Design | Wasted space, rigid layout | Full viewport, consistent visual identity |
| AI | None | AI-assisted sprite generation |
Digital Audio Workstation (Simplified Harness)
After removing sprint decomposition for Opus 4.6:
Prompt: “Build a fully featured DAW in the browser using the Web Audio API.”
| Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build Round 1 | 2 hours 7 min | $71.08 |
| QA Round 1 | 8.8 min | $3.24 |
| Build Round 2 | 1 hour 2 min | $36.89 |
| QA Round 2 | 6.8 min | $3.09 |
| Build Round 3 | 10.9 min | $5.88 |
| QA Round 3 | 9.6 min | $4.06 |
| Total | 3 hours 50 min | $124.70 |
The generator ran coherently for over two hours without sprint decomposition. The final application included a working arrangement view, mixer, transport, and an AI agent that could compose songs autonomously.
QA Agent Challenges
Building an effective evaluator required its own iteration:
- Initial QA agents showed poor judgment — identified legitimate issues but rationalized them away, tested superficially rather than probing edge cases
- Tuning loop: read evaluator logs → identify judgment divergences from human standards → update QA prompts
- Remaining gaps: layout issues, unintuitive interactions, bugs in deeply nested features
The evaluator is not a solved problem — it’s a continuously tuned component. But even imperfect evaluation dramatically outperforms self-evaluation.
Harness Simplification Principle
The most important takeaway for harness engineers:
“Every harness component encodes assumptions about model capabilities. These assumptions warrant stress testing — they may be incorrect and can quickly become obsolete as models improve.”
When Claude Opus 4.6 arrived with improved long-horizon planning and debugging, the sprint construct was removed. The result: simpler harness, comparable or better output.
The workflow: always experiment with the target model → read traces on realistic problems → tune performance → when new models arrive, re-examine the harness and strip non-load-bearing components.
Better models need less scaffolding. But they also create space for more ambitious harnesses achieving capabilities impossible before. The work of harness engineering is finding the right balance — neither over-constraining a capable model nor under-supporting a limited one.
Source
Harness Design for Long-Running Application Development — Prithvi Rajasekaran, Anthropic, March 2026.