Agent Execution Loop

Agent system authors — what actually happens when a Claude Code conversation runs: the query async generator, the 14-step per-turn pipeline, StreamingToolExecutor, retry / recovery / circuit-break paths, all source-grounded.

The gap this chapter fills

Previous chapters covered the static composition: how prompt is assembled, where memory lives, how compaction works, how permissions decide. But one core question went unanswered: how does a conversation actually run?

  • When the user hits Enter, what does Claude Code do internally?
  • How is the ReAct loop implemented?
  • How are multi-turn tool calls scheduled?
  • How does recovery work after a turn errors?
  • What’s the machinery behind claude --resume picking up where you left off?

This chapter reconstructs the agent’s runtime lifecycle from source. Main references: query.ts (1729 lines), QueryEngine.ts (1295 lines), Task.ts (125 lines), and the query/ directory.


Top level: query() is an async generator

Claude Code’s main loop is a single function:

// query.ts line 219
export async function* query(
  params: QueryParams,
): AsyncGenerator<
  | StreamEvent
  | RequestStartEvent
  | Message
  | TombstoneMessage
  | ToolUseSummaryMessage,
  Terminal
> {
  const consumedCommandUuids: string[] = []
  const terminal = yield* queryLoop(params, consumedCommandUuids)
  for (const uuid of consumedCommandUuids) {
    notifyCommandLifecycle(uuid, 'completed')
  }
  return terminal
}

Two key design decisions:

  1. Async generator instead of Promise<Response> — the UI layer consumes yielded StreamEvents in real time; the character-by-character streaming output users see flows from here
  2. Terminal return type — the generator has an explicit exit reason (Terminal), not a black box. Callers can distinguish “normal completion / aborted / hit blocking limit / maxTurns exhausted” etc.

queryLoop is the internal implementation; query is a thin shell wrapping command lifecycle notifications — if the loop throws or is .return()ed, commands don’t get completed notified (only normal return does). This is lifecycle asymmetry: started doesn’t guarantee completed.


QueryParams: the full input to one call

export type QueryParams = {
  messages: Message[]                      // Conversation so far
  systemPrompt: SystemPrompt                // Pre-assembled system prompt
  userContext: { [k: string]: string }      // CLAUDE.md / currentDate
  systemContext: { [k: string]: string }    // gitStatus / cacheBreaker
  canUseTool: CanUseToolFn                  // Permission check callback
  toolUseContext: ToolUseContext            // Tool execution context (mode, allowed tools)
  fallbackModel?: string                    // Fallback model on primary failure
  querySource: QuerySource                  // Call source ID (repl_main_thread / compact / ...)
  maxOutputTokensOverride?: number          // Single-turn output override
  maxTurns?: number                         // Loop upper bound
  skipCacheWrite?: boolean                  // Skip cache write
  taskBudget?: { total: number }            // API-side task budget (beta)
  deps?: QueryDeps                          // Injectable deps (for testing)
}

Injectable deps is a clever design (query/deps.ts):

export type QueryDeps = {
  callModel: typeof queryModelWithStreaming    // LLM call
  microcompact: typeof microcompactMessages    // Tool result clearing
  autocompact: typeof autoCompactIfNeeded      // Auto compaction
  uuid: () => string                            // ID generation
}

The source comment explains why: “tests can inject fakes directly instead of spyOn-per-module — the most common mocks (callModel, autocompact) are each spied in 6-8 test files today with module-import-and-spy boilerplate”.

Takeaway for your own agent: the top-level loop function’s dependencies must be injectable — tests inject fakes directly, no spy-on-6-modules boilerplate. The foundation of testability.


Loop state: a 14-field state machine

queryLoop is an infinite while loop with a State object carrying cross-iteration state:

type State = {
  messages: Message[]                                            // Conversation history
  toolUseContext: ToolUseContext                                 // Tool context
  autoCompactTracking: AutoCompactTrackingState | undefined      // Compaction circuit breaker
  maxOutputTokensRecoveryCount: number                           // Max-tokens recovery count
  hasAttemptedReactiveCompact: boolean                           // Has reactive compaction been tried?
  maxOutputTokensOverride: number | undefined                    // Output token override
  pendingToolUseSummary: Promise<ToolUseSummaryMessage | null>   // Async tool-use summary
  stopHookActive: boolean | undefined                            // Stop hook running?
  turnCount: number                                              // Current turn
  transition: Continue | undefined                               // Why did the last iteration continue
}

Each field answers one specific question, no redundancy:

  • transition: why the last iteration didn’t finish, directly driving the next iteration’s handling (comment: “Lets tests assert recovery paths fired without inspecting message contents” — tests can assert “recovery path fired” without inspecting message content)
  • maxOutputTokensRecoveryCount: an independent sub-loop counter — on max-output-tokens error, the loop retries with a larger cap multiple times; this isn’t the global turnCount
  • hasAttemptedReactiveCompact: each turn can try reactive compact once only, avoiding infinite loops
  • pendingToolUseSummary: tool-use summary runs async — main loop doesn’t wait for it, background generation

Takeaway for your own agent: a state with many fields isn’t inherently bad — the key is that each field has a specific semantic responsibility. 14 fields × clear semantics beats 4 fields × { [key: string]: any }.


Per-turn 14 steps: the processing pipeline

query() · 14-Step Per-Turn Pipeline Async generator yields events; queryLoop while(true) with a 10-field state machine. while (true) { PHASE 1 · SETUP 3 steps · prepare per-turn state 1 State destructure pull this turn's fields from the 10-field State object 2 Skill prefetch (concurrent) startSkillDiscoveryPrefetch — runs in background during LLM stream 3 yield { type: 'stream_request_start' } tell UI "this turn started" PHASE 2 · HISTORY PREP 6 steps · cheap → expensive compaction cascade 4 getMessagesAfterCompactBoundary only look at messages after the last compact boundary 5 applyToolResultBudget per-message budget for tool result size — must run BEFORE microcompact 6 HISTORY_SNIP (feature-flagged) strategy-based history removal before compact 7 microcompact — tool result clearing (0 LLM calls) time-based OR cached-microcompact (server-side cache editing) 8 contextCollapse (feature-flagged) alternative context management; autocompact stands down if enabled 9 autocompact — LLM summarization (1 LLM call) fires at effectiveWindow - 13k; 3 consecutive failures → circuit break PHASE 3 · RUN & TOOLS 5 steps · call model, execute tools 10 Blocking limit check if hard-ceiling reached → yield error, return { reason: 'blocking_limit' } 11 deps.callModel — streaming LLM call model fallback + streaming fallback (2 retry layers) 12 yield messages (tombstone on streaming fallback) tombstone = explicit void marker for already-streamed messages 13 StreamingToolExecutor or runTools streaming executor runs tools as tool_use blocks arrive, not after full message 14 Collect toolResults, decide continue needsFollowUp = toolUseBlocks.length > 0 } if (needsFollowUp) continue ↑ else return { reason } → TERMINAL REASONS · 6 exits from the loop done no tool_use this turn aborted user interrupted blocking_limit hard ceiling hit max_turns turnCount > limit stop_hook_blocked Stop hook returned block error unrecoverable

Each while-loop iteration runs up to these 14 steps (many are conditional and skip):

#StepSource locationPurpose
1State destructurequery.ts lines 311-321Pull this turn’s needed fields from State
2Skill prefetchline 331 startSkillDiscoveryPrefetchConcurrent prefetch of relevant skills, runs during LLM streaming
3Yield stream_request_startline 337UI “starting” signal
4getMessagesAfterCompactBoundaryline 365Only look at messages after the compact boundary; already-compacted skip
5applyToolResultBudgetline 379Enforce per-message tool-result budget
6HISTORY_SNIP (feature-flagged)line 401Strategy-based history clearing
7Microcompactline 414Tool result clearing (old Read/Bash results)
8Context Collapse (feature-flagged)line 441Alternative context management system
9Auto-compactline 454LLM summarization (see Compaction)
10Blocking limit checkline 641If hitting hard ceiling, yield error and return { reason: 'blocking_limit' }
11callModel streaming callline 659 deps.callModelCall LLM, stream messages/events
12Yield messagesline 708+Yield to UI one by one (including tombstone rollback)
13Tool executionStreamingToolExecutor or runToolsParallel / serial tool execution
14Collect toolResults, decide continueend of loopneedsFollowUp = toolUseBlocks.length > 0

Key details:

2. Skill prefetch runs parallel to LLM streaming

const pendingSkillPrefetch = skillPrefetch?.startSkillDiscoveryPrefetch(...)
// ... continue processing
// ... call LLM, streaming receive
// ... skill prefetch runs in background during LLM response

The comment says: “Replaces the blocking assistant_turn path that ran inside getAttachmentMessages (97% of those calls found nothing in prod).” Originally skill discovery blocked; in production 97% of calls found nothing but still blocked the whole turn. Now concurrent, near-zero cost.

4. getMessagesAfterCompactBoundary — compaction boundary protection

After compaction, old messages get replaced with a summary. Here we only take messages after the most recent boundary. Comment: “REPL keeps snipped messages for UI scrollback — project so the compact model doesn’t summarize content that was intentionally removed” — UI’s scrollback keeps “snipped” messages for display, but the model can’t see them (would be re-compacted otherwise).

5. applyToolResultBudget — per-message tool result budget

Enforce per-message budget on aggregate tool result size. Runs BEFORE microcompact — cached MC operates purely by tool_use_id (never inspects content), so content replacement is invisible to it and the two compose cleanly.

Meaning: there’s a per-message total budget for tool results (different tools can have different ceilings); exceeding replaces content with a placeholder. The ordering is critical — must be before microcompact, because cached microcompact only inspects tool_use_id not content; the two compose seamlessly.

10. Blocking limit — proactive block before hard ceiling

const { isAtBlockingLimit } = calculateTokenWarningState(
  tokenCountWithEstimation(messagesForQuery) - snipTokensFreed,
  model,
)
if (isAtBlockingLimit) {
  yield createAssistantAPIErrorMessage({ content: PROMPT_TOO_LONG_ERROR_MESSAGE, ... })
  return { reason: 'blocking_limit' }
}

When auto-compaction is disabled, this check proactively blocks over-limit — leaves MANUAL_COMPACT_BUFFER_TOKENS = 3000 for manual /compact. The comment details four cases this gate must skip:

  • Just compacted (compactionResult) — usage counts are stale
  • querySource === 'compact' / 'session_memory' — forked agents would deadlock
  • Reactive compact enabled — let actual 413 trigger reactive
  • Context collapse enabled — collapse manages itself

Takeaway for your own agent: hard-ceiling interception shouldn’t be a global switch — must be able to precisely exempt special call paths. Otherwise those paths deadlock at blocking limit.

13. Streaming Tool Executor — execute while streaming

const useStreamingToolExecution = config.gates.streamingToolExecution
let streamingToolExecutor = useStreamingToolExecution
  ? new StreamingToolExecutor(...)
  : null

Two paths:

  • Traditional: LLM finishes streaming → parse tool_use → serial / parallel execute → get results
  • Streaming (StreamingToolExecutor): tool_use block starts executing immediately as the LLM streams it out, not waiting for the full assistant message

Latency drops significantly — for multiple independent tool calls, approaches parallel wall-clock time.


Model fallback + streaming fallback

Line 654’s while (attemptWithFallback) is double-layer fallback logic:

  1. Model fallback: primary model fails (API error / throttle) → switch to fallbackModel and retry
  2. Streaming fallback: streaming mode errors (e.g., thinking block exception) → fall back to non-streaming

The source has a particularly intricate but important handling: when streaming fallback fires, the half-streamed assistant messages must be tombstoned — they may have invalid thinking block signatures, and resubmitting would fail the API.

if (streamingFallbackOccured) {
  for (const msg of assistantMessages) {
    yield { type: 'tombstone' as const, message: msg }  // UI / transcript: delete this
  }
  assistantMessages.length = 0
  toolResults.length = 0
  // ... discard pending tool results, recreate executor
}

Tombstone messages are UI / transcript “deletion markers” — the messages already streamed can’t be recalled from the client, but tombstones tell downstream “this is void.”

Takeaway for your own agent: the retract mechanism for streaming output must exist. LLMs streaming halfway and finding an issue can’t “take back” already-streamed characters — you need an explicit void marker.


Termination: 6 Terminal reasons

From all return { reason: ... } branches in query.ts, the loop can terminate with these reasons:

ReasonConditionMeaning
blocking_limitHit hard ceilingManual compact can’t help
max_turnsturnCount > maxTurnsUser-set turn upper bound
doneNo tool_use this turnModel believes task complete
abortedabortController.signal.abortedUser interrupted
stop_hook_blockedStop hook returned blockUser hook blocked continuation
errorOther exceptionsUnrecoverable error

Different reasons trigger different follow-ups — “done” shows completion in UI, “aborted” shows “cancelled”, “blocking_limit” prompts manual compact, “max_turns” suggests raising maxTurns.


Task Layer: 7 task types

Top-level agent invocation is wrapped in Task.ts. 7 TaskType values:

export type TaskType =
  | 'local_bash'           // Local shell task
  | 'local_agent'          // Locally running subagent
  | 'remote_agent'         // CCR cloud agent
  | 'in_process_teammate'  // In-process teammate
  | 'local_workflow'       // Local workflow
  | 'monitor_mcp'          // MCP server monitor
  | 'dream'                // Nightly memory-curation dream job

5 statuses:

export type TaskStatus = 'pending' | 'running' | 'completed' | 'failed' | 'killed'

Task IDs have prefixes (TASK_ID_PREFIXES):

{
  local_bash: 'b',           // Kept as 'b' for backward compatibility
  local_agent: 'a',
  remote_agent: 'r',
  in_process_teammate: 't',
  local_workflow: 'w',
  monitor_mcp: 'm',
  dream: 'd',
}

Random 8-char suffix from 36^8 ≈ 2.8 trillion combinations — source comment: “sufficient to resist brute-force symlink attacks”.

Why IDs must resist symlink attacks: task output file paths come from ID (getTaskOutputPath(id)). If an attacker can predict the ID, they can pre-create a symlink pointing to an arbitrary file, making the task’s stdout write to that file. 36^8 entropy makes this impractical.

Takeaway for your own agent: any system using IDs as file paths must consider ID predictability — “ID not random enough causing race conditions or attacks” is a common production incident.


Resume: claude --resume and session storage

Previously compaction mentioned: “Background jobs that summarize previous conversations for the claude --resume feature” — resume’s compaction is pre-computed in background.

Mechanism breakdown:

  1. Session storage: every conversation writes to ~/.claude/projects/<project>/sessions/<sessionId>.jsonl
  2. Background summary agent: after exiting Claude Code, a background job reads the session file and produces a summary
  3. On resume: read session + summary, reconstruct State, enter queryLoop

Resume is near-instant — summary already computed. The comment: “subscribers can use /stats to view usage patterns” — usage data persists, visible cross-session.


Full-chain abort path

toolUseContext.abortController is the central cancellation point threaded through the call chain:

User presses Esc
  → abortController.abort()
  → signal.aborted = true
  → LLM stream interrupted (AbortSignal passed to fetch)
  → all in-flight tools receive signal (tool.execute's second arg)
  → each tool's cleanup (bwrap process kill, file lock release, ...)
  → queryLoop checks signal.aborted → return { reason: 'aborted' }

Key design: each layer observes the signal itself, not waiting for upper-layer notification. LLM fetch natively supports AbortSignal; tool execute’s second arg always includes the signal; bash process wait watches the signal — cancellation is broadcast across the chain, not forwarded layer-by-layer.

Takeaway for your own agent: AbortController must thread from entry point to every leaf operation. Half-assed abort support is worse than none — users think they cancelled but something’s still running.


QueryEngine: the level of a single LLM call

QueryEngine.ts (1295 lines) is the low level of the callModel function, dedicated to streaming consumption of a single LLM call:

  • Parse SSE events (content_block_start / content_block_delta / content_block_stop / message_delta / …)
  • Build assistant messages
  • Handle max_tokens / stop_reason / various API errors
  • Streaming fallback (discussed above)
  • Thinking block special handling (signature verification)
  • Usage tracking (including cache read / cache creation token counts separately)

This layer’s complexity comes from assembling the API’s “low-level event stream” into “high-level assistant messages” while staying cancel-safe / error-safe / partial-state-safe. Not a toy — production-grade streaming API consumer logic is at least 1000+ lines.


Takeaways for building your own agent

  1. Main loop as async generator, not Promise<Response> — real-time streaming to UI is baseline for agent products
  2. Terminal return value must carry reason — different termination reasons drive different UX; opaque undefined can’t support that
  3. Dependency injection: high-frequency mocks (callModel, microcompact, autocompact) as a deps object — otherwise tests must spy 6-8 modules
  4. State fields with clear semantics: 14 fields each answering one question beats an any bucket
  5. Per-turn processing is an explicit pipeline: snip → microcompact → collapse → autocompact → blocking check → model call → tools. The order is a design choice; annotate why this order
  6. Concurrent prefetch: skill / memory / cache params can run in background during LLM streaming — manage with Promise.all or using disposable for lifecycle
  7. Streaming tool executor: tool_use block starts executing as LLM streams it, not waiting for the full assistant message — significantly lower latency for multi-tool scenarios
  8. Tombstone messages as explicit “streamed but void” markers — streaming output can’t be client-side retracted; need an explicit void marker
  9. Model fallback + streaming fallback are two layers — handling API errors vs streaming exceptions
  10. AbortController threads to leaves — each layer watches the signal itself, not forwarded layer-by-layer. Half-assed cancel is worse than none
  11. Task IDs need enough entropy — when IDs become filenames, consider symlink attacks. 36^8 is Claude Code’s choice
  12. Resume’s compaction pre-computed in background — UX-wise resume is instant, not “started when exit happened”

Further reading

  • Claude Code source: query.ts, QueryEngine.ts, Task.ts, query/{deps,stopHooks,tokenBudget,config}.ts
  • Compaction — steps 7-9 of the per-turn pipeline detailed here
  • Execution Environment — worktree / remote implementations behind the Task types
  • PermissionscanUseTool’s decision path
Was this page helpful?