Long-Running Agent

The Core Challenge

AI agents operate in discrete sessions. Each session begins with a fresh context window — no memory of what came before. For tasks that span hours or days, this is a fundamental constraint: the agent must somehow maintain coherence across multiple context windows, each starting from zero.

This mirrors the challenge of shift-based engineering teams. Each new shift worker arrives without context from the previous shift. Without structured handoff protocols, work degrades: tasks are repeated, progress is lost, and the project drifts.

Two Failure Modes

Anthropic’s testing of long-running autonomous coding agents revealed two critical failure patterns:

Failure Mode	Symptom
Over-ambition	Agent attempts complete implementation in a single session, runs out of context mid-feature
Premature completion	Later agent instances see partial progress and declare the project finished

Both failures stem from the same root cause: the agent lacks structured awareness of what has been done and what remains.

Two-Agent Solution

The solution splits agent work into two specialized roles, each with its own prompt and protocol:

Initializer Agent

The first session runs a specialized initializer that establishes the project infrastructure:

init.sh — executable script for environment setup (dependencies, dev server, etc.)
Feature list — comprehensive JSON catalog of all required functionality with verification steps
claude-progress.txt — work history document for cross-session continuity
Initial git commit — baseline state that all subsequent work builds on

The feature list is the key artifact. Each entry contains:

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": ["Click new chat button", "Verify empty conversation appears", "Verify URL updates"],
  "passes": false
}

Critical constraint: subsequent agents may only modify the passes field — never the feature descriptions or steps. This prevents drift in success criteria across sessions.

Coding Agent

Every subsequent session follows a structured startup protocol:

Execute pwd to verify working directory
Read claude-progress.txt and recent git log
Consult feature list, select highest-priority incomplete item
Run init.sh to start development environment
Execute baseline tests to verify existing functionality
Implement the selected feature
Verify through end-to-end testing
Update progress file, commit with descriptive message

The agent works one feature at a time, verifying completion before moving on. Git serves as the recovery mechanism — if a change breaks existing functionality, the agent can revert to the last known-good state.

The Context Bridge

The claude-progress.txt file alongside git history forms a context bridge — structured artifacts that enable rapid orientation when starting a fresh context window.

This is a deliberate engineering choice. Rather than relying on context compaction (which loses information) or massive context windows (which degrade attention), the harness externalizes state into durable files that any new session can read.

The pattern mirrors how professional software engineers work: clear handoff documentation, incremental progress tracking, and verified completion criteria. The harness doesn’t try to make the model superhuman — it gives the model the same tools that make human engineers effective.

Testing Strategy

A critical discovery: prompting agents toward user-perspective testing dramatically improved feature reliability compared to developer-focused testing.

Testing Approach	Behavior	Result
Developer-focused	Agent checks code logic, runs unit tests	Misses UI bugs, integration failures
User-perspective	Agent uses browser automation (e.g., Puppeteer MCP) to test as a real user	Catches visual bugs, interaction failures, end-to-end issues

The harness explicitly instructs the coding agent to verify features end-to-end using browser automation tools, testing from the user’s perspective rather than the developer’s.

Session Flow

Each session is self-contained: it reads everything it needs from durable artifacts, does incremental work, and leaves the project in a clean state for the next session. No session depends on information that exists only in a previous session’s context window.

Session 1 (Initializer) — Create init.sh, generate the comprehensive feature list, write claude-progress.txt, and make the initial git commit.
Sessions 2..N (Coding Agent) — Read progress + git log, select the next incomplete feature, run init.sh to start the dev environment, baseline-test existing features, implement the selected feature, verify end-to-end, and update progress + git commit. When context is exhausted, a new session starts with full state inherited from the artifacts.

Key Insights

Incremental over ambitious — Agents that attempt to implement everything at once run out of context mid-feature, leaving broken code. One-feature-at-a-time with verification is slower per session but reliably converges.

Git as safety net — Version control isn’t just for history. It’s the agent’s undo mechanism. When an implementation breaks existing features, git revert recovers the last working state instantly.

Structured handoffs over raw memory — Externalizing state into files (claude-progress.txt, feature list, git log) is more reliable than any in-context memory or compaction strategy. The information is lossless and always available.

Human-inspired practices — The most effective agent harnesses don’t invent novel AI workflows. They replicate the practices that make human engineering teams effective: clear documentation, incremental progress, verified completion, and clean handoffs.

Source

Effective Harnesses for Long-Running Agents — Justin Young, Anthropic, November 2025.