Long-Running Harness

Making agents work across multiple context windows — the initializer/coder two-agent harness pattern for multi-session autonomous tasks

The Core Challenge

AI agents operate in discrete sessions. Each session begins with a fresh context window — no memory of what came before. For tasks that span hours or days, this is a fundamental constraint: the agent must somehow maintain coherence across multiple context windows, each starting from zero.

This mirrors the challenge of shift-based engineering teams. Each new shift worker arrives without context from the previous shift. Without structured handoff protocols, work degrades: tasks are repeated, progress is lost, and the project drifts.


Two Failure Modes

Anthropic’s testing of long-running autonomous coding agents revealed two critical failure patterns:

Failure ModeSymptom
Over-ambitionAgent attempts complete implementation in a single session, runs out of context mid-feature
Premature completionLater agent instances see partial progress and declare the project finished

Both failures stem from the same root cause: the agent lacks structured awareness of what has been done and what remains.


Two-Agent Solution

Two-Agent Long-Running Pattern Initializer creates artifacts → Context Bridge persists → Coder sessions consume and update Initializer Session 1 • Create init.sh • Generate feature list • Create progress.txt • Initial git commit One-time setup Context Bridge Persistent artifacts on disk • init.sh • feature-list.json • claude-progress.txt • git history Lossless · Always available Coder Sessions 2..N • Read progress • Select one feature • Implement + verify • Commit + update One feature at a time creates reads updates Each session is self-contained Reads from durable artifacts, does incremental work, leaves clean state for the next session

The solution splits agent work into two specialized roles, each with its own prompt and protocol:

Initializer Agent

The first session runs a specialized initializer that establishes the project infrastructure:

  1. init.sh — executable script for environment setup (dependencies, dev server, etc.)
  2. Feature list — comprehensive JSON catalog of all required functionality with verification steps
  3. claude-progress.txt — work history document for cross-session continuity
  4. Initial git commit — baseline state that all subsequent work builds on

The feature list is the key artifact. Each entry contains:

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": ["Click new chat button", "Verify empty conversation appears", "Verify URL updates"],
  "passes": false
}

Critical constraint: subsequent agents may only modify the passes field — never the feature descriptions or steps. This prevents drift in success criteria across sessions.

Coding Agent

Every subsequent session follows a structured startup protocol:

  1. Execute pwd to verify working directory
  2. Read claude-progress.txt and recent git log
  3. Consult feature list, select highest-priority incomplete item
  4. Run init.sh to start development environment
  5. Execute baseline tests to verify existing functionality
  6. Implement the selected feature
  7. Verify through end-to-end testing
  8. Update progress file, commit with descriptive message

The agent works one feature at a time, verifying completion before moving on. Git serves as the recovery mechanism — if a change breaks existing functionality, the agent can revert to the last known-good state.


The Context Bridge

The claude-progress.txt file alongside git history forms a context bridge — structured artifacts that enable rapid orientation when starting a fresh context window.

This is a deliberate engineering choice. Rather than relying on context compaction (which loses information) or massive context windows (which degrade attention), the harness externalizes state into durable files that any new session can read.

The pattern mirrors how professional software engineers work: clear handoff documentation, incremental progress tracking, and verified completion criteria. The harness doesn’t try to make the model superhuman — it gives the model the same tools that make human engineers effective.


Testing Strategy

A critical discovery: prompting agents toward user-perspective testing dramatically improved feature reliability compared to developer-focused testing.

Testing ApproachBehaviorResult
Developer-focusedAgent checks code logic, runs unit testsMisses UI bugs, integration failures
User-perspectiveAgent uses browser automation (e.g., Puppeteer MCP) to test as a real userCatches visual bugs, interaction failures, end-to-end issues

The harness explicitly instructs the coding agent to verify features end-to-end using browser automation tools, testing from the user’s perspective rather than the developer’s.


Session Flow

Session 1 (Initializer)
  ├─ Create init.sh
  ├─ Generate comprehensive feature list
  ├─ Create claude-progress.txt
  └─ Initial git commit

Session 2..N (Coding Agent)
  ├─ Read progress + git log
  ├─ Select next incomplete feature
  ├─ Run init.sh → start dev environment
  ├─ Baseline test existing features
  ├─ Implement selected feature
  ├─ End-to-end verification
  ├─ Update progress + git commit
  └─ (context exhausted → new session starts)

Each session is self-contained: it reads everything it needs from durable artifacts, does incremental work, and leaves the project in a clean state for the next session. No session depends on information that exists only in a previous session’s context window.


Key Insights

Incremental over ambitious — Agents that attempt to implement everything at once run out of context mid-feature, leaving broken code. One-feature-at-a-time with verification is slower per session but reliably converges.

Git as safety net — Version control isn’t just for history. It’s the agent’s undo mechanism. When an implementation breaks existing features, git revert recovers the last working state instantly.

Structured handoffs over raw memory — Externalizing state into files (claude-progress.txt, feature list, git log) is more reliable than any in-context memory or compaction strategy. The information is lossless and always available.

Human-inspired practices — The most effective agent harnesses don’t invent novel AI workflows. They replicate the practices that make human engineering teams effective: clear documentation, incremental progress, verified completion, and clean handoffs.


Source

Effective Harnesses for Long-Running Agents — Justin Young, Anthropic, November 2025.

Was this page helpful?