Long-Running Harness
Making agents work across multiple context windows — the initializer/coder two-agent harness pattern for multi-session autonomous tasks
The Core Challenge
AI agents operate in discrete sessions. Each session begins with a fresh context window — no memory of what came before. For tasks that span hours or days, this is a fundamental constraint: the agent must somehow maintain coherence across multiple context windows, each starting from zero.
This mirrors the challenge of shift-based engineering teams. Each new shift worker arrives without context from the previous shift. Without structured handoff protocols, work degrades: tasks are repeated, progress is lost, and the project drifts.
Two Failure Modes
Anthropic’s testing of long-running autonomous coding agents revealed two critical failure patterns:
| Failure Mode | Symptom |
|---|---|
| Over-ambition | Agent attempts complete implementation in a single session, runs out of context mid-feature |
| Premature completion | Later agent instances see partial progress and declare the project finished |
Both failures stem from the same root cause: the agent lacks structured awareness of what has been done and what remains.
Two-Agent Solution
The solution splits agent work into two specialized roles, each with its own prompt and protocol:
Initializer Agent
The first session runs a specialized initializer that establishes the project infrastructure:
init.sh— executable script for environment setup (dependencies, dev server, etc.)- Feature list — comprehensive JSON catalog of all required functionality with verification steps
claude-progress.txt— work history document for cross-session continuity- Initial git commit — baseline state that all subsequent work builds on
The feature list is the key artifact. Each entry contains:
{
"category": "functional",
"description": "New chat button creates a fresh conversation",
"steps": ["Click new chat button", "Verify empty conversation appears", "Verify URL updates"],
"passes": false
}
Critical constraint: subsequent agents may only modify the passes field — never the feature descriptions or steps.
This prevents drift in success criteria across sessions.
Coding Agent
Every subsequent session follows a structured startup protocol:
- Execute
pwdto verify working directory - Read
claude-progress.txtand recent git log - Consult feature list, select highest-priority incomplete item
- Run
init.shto start development environment - Execute baseline tests to verify existing functionality
- Implement the selected feature
- Verify through end-to-end testing
- Update progress file, commit with descriptive message
The agent works one feature at a time, verifying completion before moving on. Git serves as the recovery mechanism — if a change breaks existing functionality, the agent can revert to the last known-good state.
The Context Bridge
The claude-progress.txt file alongside git history forms a context bridge — structured artifacts that enable rapid
orientation when starting a fresh context window.
This is a deliberate engineering choice. Rather than relying on context compaction (which loses information) or massive context windows (which degrade attention), the harness externalizes state into durable files that any new session can read.
The pattern mirrors how professional software engineers work: clear handoff documentation, incremental progress tracking, and verified completion criteria. The harness doesn’t try to make the model superhuman — it gives the model the same tools that make human engineers effective.
Testing Strategy
A critical discovery: prompting agents toward user-perspective testing dramatically improved feature reliability compared to developer-focused testing.
| Testing Approach | Behavior | Result |
|---|---|---|
| Developer-focused | Agent checks code logic, runs unit tests | Misses UI bugs, integration failures |
| User-perspective | Agent uses browser automation (e.g., Puppeteer MCP) to test as a real user | Catches visual bugs, interaction failures, end-to-end issues |
The harness explicitly instructs the coding agent to verify features end-to-end using browser automation tools, testing from the user’s perspective rather than the developer’s.
Session Flow
Session 1 (Initializer)
├─ Create init.sh
├─ Generate comprehensive feature list
├─ Create claude-progress.txt
└─ Initial git commit
Session 2..N (Coding Agent)
├─ Read progress + git log
├─ Select next incomplete feature
├─ Run init.sh → start dev environment
├─ Baseline test existing features
├─ Implement selected feature
├─ End-to-end verification
├─ Update progress + git commit
└─ (context exhausted → new session starts)
Each session is self-contained: it reads everything it needs from durable artifacts, does incremental work, and leaves the project in a clean state for the next session. No session depends on information that exists only in a previous session’s context window.
Key Insights
Incremental over ambitious — Agents that attempt to implement everything at once run out of context mid-feature, leaving broken code. One-feature-at-a-time with verification is slower per session but reliably converges.
Git as safety net — Version control isn’t just for history. It’s the agent’s undo mechanism. When an implementation
breaks existing features, git revert recovers the last working state instantly.
Structured handoffs over raw memory — Externalizing state into files (claude-progress.txt, feature list, git log)
is more reliable than any in-context memory or compaction strategy. The information is lossless and always available.
Human-inspired practices — The most effective agent harnesses don’t invent novel AI workflows. They replicate the practices that make human engineering teams effective: clear documentation, incremental progress, verified completion, and clean handoffs.
Source
Effective Harnesses for Long-Running Agents — Justin Young, Anthropic, November 2025.