Agent Execution Loop

The gap this chapter fills

Previous chapters covered the static composition: how prompt is assembled, where memory lives, how compaction works, how permissions decide. But one core question went unanswered: how does a conversation actually run?

When the user hits Enter, what does Claude Code do internally?
How is the ReAct loop implemented?
How are multi-turn tool calls scheduled?
How does recovery work after a turn errors?
What’s the machinery behind claude --resume picking up where you left off?

This chapter reconstructs the agent’s runtime lifecycle from source. Main references: query.ts (1729 lines), QueryEngine.ts (1295 lines), Task.ts (125 lines), and the query/ directory.

Top level: `query()` is an async generator

Claude Code’s main loop is a single function:

// query.ts line 219
export async function* query(
  params: QueryParams,
): AsyncGenerator<
  | StreamEvent
  | RequestStartEvent
  | Message
  | TombstoneMessage
  | ToolUseSummaryMessage,
  Terminal
> {
  const consumedCommandUuids: string[] = []
  const terminal = yield* queryLoop(params, consumedCommandUuids)
  for (const uuid of consumedCommandUuids) {
    notifyCommandLifecycle(uuid, 'completed')
  }
  return terminal
}

Two key design decisions:

Async generator instead of Promise<Response> — the UI layer consumes yielded StreamEvents in real time; the character-by-character streaming output users see flows from here
Terminal return type — the generator has an explicit exit reason (Terminal), not a black box. Callers can distinguish “normal completion / aborted / hit blocking limit / maxTurns exhausted” etc.

queryLoop is the internal implementation; query is a thin shell wrapping command lifecycle notifications — if the loop throws or is .return()ed, commands don’t get completed notified (only normal return does). This is lifecycle asymmetry: started doesn’t guarantee completed.

QueryParams: the full input to one call

export type QueryParams = {
  messages: Message[]                      // Conversation so far
  systemPrompt: SystemPrompt                // Pre-assembled system prompt
  userContext: { [k: string]: string }      // CLAUDE.md / currentDate
  systemContext: { [k: string]: string }    // gitStatus / cacheBreaker
  canUseTool: CanUseToolFn                  // Permission check callback
  toolUseContext: ToolUseContext            // Tool execution context (mode, allowed tools)
  fallbackModel?: string                    // Fallback model on primary failure
  querySource: QuerySource                  // Call source ID (repl_main_thread / compact / ...)
  maxOutputTokensOverride?: number          // Single-turn output override
  maxTurns?: number                         // Loop upper bound
  skipCacheWrite?: boolean                  // Skip cache write
  taskBudget?: { total: number }            // API-side task budget (beta)
  deps?: QueryDeps                          // Injectable deps (for testing)
}

Injectable deps is a clever design (query/deps.ts):

export type QueryDeps = {
  callModel: typeof queryModelWithStreaming    // LLM call
  microcompact: typeof microcompactMessages    // Tool result clearing
  autocompact: typeof autoCompactIfNeeded      // Auto compaction
  uuid: () => string                            // ID generation
}

The source comment explains why: “tests can inject fakes directly instead of spyOn-per-module — the most common mocks (callModel, autocompact) are each spied in 6-8 test files today with module-import-and-spy boilerplate”.

Takeaway for your own agent: the top-level loop function’s dependencies must be injectable — tests inject fakes directly, no spy-on-6-modules boilerplate. The foundation of testability.

Loop state: a 14-field state machine

queryLoop is an infinite while loop with a State object carrying cross-iteration state:

type State = {
  messages: Message[]                                            // Conversation history
  toolUseContext: ToolUseContext                                 // Tool context
  autoCompactTracking: AutoCompactTrackingState | undefined      // Compaction circuit breaker
  maxOutputTokensRecoveryCount: number                           // Max-tokens recovery count
  hasAttemptedReactiveCompact: boolean                           // Has reactive compaction been tried?
  maxOutputTokensOverride: number | undefined                    // Output token override
  pendingToolUseSummary: Promise<ToolUseSummaryMessage | null>   // Async tool-use summary
  stopHookActive: boolean | undefined                            // Stop hook running?
  turnCount: number                                              // Current turn
  transition: Continue | undefined                               // Why did the last iteration continue
}

Each field answers one specific question, no redundancy:

transition: why the last iteration didn’t finish, directly driving the next iteration’s handling (comment: “Lets tests assert recovery paths fired without inspecting message contents” — tests can assert “recovery path fired” without inspecting message content)
maxOutputTokensRecoveryCount: an independent sub-loop counter — on max-output-tokens error, the loop retries with a larger cap multiple times; this isn’t the global turnCount
hasAttemptedReactiveCompact: each turn can try reactive compact once only, avoiding infinite loops
pendingToolUseSummary: tool-use summary runs async — main loop doesn’t wait for it, background generation

Takeaway for your own agent: a state with many fields isn’t inherently bad — the key is that each field has a specific semantic responsibility. 14 fields × clear semantics beats 4 fields × { [key: string]: any }.

Per-turn 14 steps: the processing pipeline

Each while-loop iteration runs up to these 14 steps (many are conditional and skip):

#	Step	Source location	Purpose
1	State destructure	query.ts lines 311-321	Pull this turn’s needed fields from State
2	Skill prefetch	line 331 `startSkillDiscoveryPrefetch`	Concurrent prefetch of relevant skills, runs during LLM streaming
3	Yield `stream_request_start`	line 337	UI “starting” signal
4	`getMessagesAfterCompactBoundary`	line 365	Only look at messages after the compact boundary; already-compacted skip
5	`applyToolResultBudget`	line 379	Enforce per-message tool-result budget
6	HISTORY_SNIP (feature-flagged)	line 401	Strategy-based history clearing
7	Microcompact	line 414	Tool result clearing (old Read/Bash results)
8	Context Collapse (feature-flagged)	line 441	Alternative context management system
9	Auto-compact	line 454	LLM summarization (see Compaction)
10	Blocking limit check	line 641	If hitting hard ceiling, yield error and `return { reason: 'blocking_limit' }`
11	`callModel` streaming call	line 659 `deps.callModel`	Call LLM, stream messages/events
12	Yield messages	line 708+	Yield to UI one by one (including tombstone rollback)
13	Tool execution	`StreamingToolExecutor` or `runTools`	Parallel / serial tool execution
14	Collect toolResults, decide continue	end of loop	`needsFollowUp = toolUseBlocks.length > 0`

Key details:

2. Skill prefetch runs parallel to LLM streaming

const pendingSkillPrefetch = skillPrefetch?.startSkillDiscoveryPrefetch(...)
// ... continue processing
// ... call LLM, streaming receive
// ... skill prefetch runs in background during LLM response

The comment says: “Replaces the blocking assistant_turn path that ran inside getAttachmentMessages (97% of those calls found nothing in prod).” Originally skill discovery blocked; in production 97% of calls found nothing but still blocked the whole turn. Now concurrent, near-zero cost.

4. `getMessagesAfterCompactBoundary` — compaction boundary protection

After compaction, old messages get replaced with a summary. Here we only take messages after the most recent boundary. Comment: “REPL keeps snipped messages for UI scrollback — project so the compact model doesn’t summarize content that was intentionally removed” — UI’s scrollback keeps “snipped” messages for display, but the model can’t see them (would be re-compacted otherwise).

5. `applyToolResultBudget` — per-message tool result budget

Enforce per-message budget on aggregate tool result size. Runs BEFORE microcompact — cached MC operates purely by tool_use_id (never inspects content), so content replacement is invisible to it and the two compose cleanly.

Meaning: there’s a per-message total budget for tool results (different tools can have different ceilings); exceeding replaces content with a placeholder. The ordering is critical — must be before microcompact, because cached microcompact only inspects tool_use_id not content; the two compose seamlessly.

10. Blocking limit — proactive block before hard ceiling

const { isAtBlockingLimit } = calculateTokenWarningState(
  tokenCountWithEstimation(messagesForQuery) - snipTokensFreed,
  model,
)
if (isAtBlockingLimit) {
  yield createAssistantAPIErrorMessage({ content: PROMPT_TOO_LONG_ERROR_MESSAGE, ... })
  return { reason: 'blocking_limit' }
}

When auto-compaction is disabled, this check proactively blocks over-limit — leaves MANUAL_COMPACT_BUFFER_TOKENS = 3000 for manual /compact. The comment details four cases this gate must skip:

Just compacted (compactionResult) — usage counts are stale
querySource === 'compact' / 'session_memory' — forked agents would deadlock
Reactive compact enabled — let actual 413 trigger reactive
Context collapse enabled — collapse manages itself

Takeaway for your own agent: hard-ceiling interception shouldn’t be a global switch — must be able to precisely exempt special call paths. Otherwise those paths deadlock at blocking limit.

13. Streaming Tool Executor — execute while streaming

const useStreamingToolExecution = config.gates.streamingToolExecution
let streamingToolExecutor = useStreamingToolExecution
  ? new StreamingToolExecutor(...)
  : null

Two paths:

Traditional: LLM finishes streaming → parse tool_use → serial / parallel execute → get results
Streaming (StreamingToolExecutor): tool_use block starts executing immediately as the LLM streams it out, not waiting for the full assistant message

Latency drops significantly — for multiple independent tool calls, approaches parallel wall-clock time.

Model fallback + streaming fallback

Line 654’s while (attemptWithFallback) is double-layer fallback logic:

Model fallback: primary model fails (API error / throttle) → switch to fallbackModel and retry
Streaming fallback: streaming mode errors (e.g., thinking block exception) → fall back to non-streaming

The source has a particularly intricate but important handling: when streaming fallback fires, the half-streamed assistant messages must be tombstoned — they may have invalid thinking block signatures, and resubmitting would fail the API.

if (streamingFallbackOccured) {
  for (const msg of assistantMessages) {
    yield { type: 'tombstone' as const, message: msg }  // UI / transcript: delete this
  }
  assistantMessages.length = 0
  toolResults.length = 0
  // ... discard pending tool results, recreate executor
}

Tombstone messages are UI / transcript “deletion markers” — the messages already streamed can’t be recalled from the client, but tombstones tell downstream “this is void.”

Takeaway for your own agent: the retract mechanism for streaming output must exist. LLMs streaming halfway and finding an issue can’t “take back” already-streamed characters — you need an explicit void marker.

Termination: 6 Terminal reasons

From all return { reason: ... } branches in query.ts, the loop can terminate with these reasons:

Reason	Condition	Meaning
`blocking_limit`	Hit hard ceiling	Manual compact can’t help
`max_turns`	`turnCount > maxTurns`	User-set turn upper bound
`done`	No tool_use this turn	Model believes task complete
`aborted`	`abortController.signal.aborted`	User interrupted
`stop_hook_blocked`	Stop hook returned block	User hook blocked continuation
`error`	Other exceptions	Unrecoverable error

Different reasons trigger different follow-ups — “done” shows completion in UI, “aborted” shows “cancelled”, “blocking_limit” prompts manual compact, “max_turns” suggests raising maxTurns.

Task Layer: 7 task types

Top-level agent invocation is wrapped in Task.ts. 7 TaskType values:

export type TaskType =
  | 'local_bash'           // Local shell task
  | 'local_agent'          // Locally running subagent
  | 'remote_agent'         // CCR cloud agent
  | 'in_process_teammate'  // In-process teammate
  | 'local_workflow'       // Local workflow
  | 'monitor_mcp'          // MCP server monitor
  | 'dream'                // Nightly memory-curation dream job

5 statuses:

export type TaskStatus = 'pending' | 'running' | 'completed' | 'failed' | 'killed'

Task IDs have prefixes (TASK_ID_PREFIXES):

{
  local_bash: 'b',           // Kept as 'b' for backward compatibility
  local_agent: 'a',
  remote_agent: 'r',
  in_process_teammate: 't',
  local_workflow: 'w',
  monitor_mcp: 'm',
  dream: 'd',
}

Random 8-char suffix from 36^8 ≈ 2.8 trillion combinations — source comment: “sufficient to resist brute-force symlink attacks”.

Why IDs must resist symlink attacks: task output file paths come from ID (getTaskOutputPath(id)). If an attacker can predict the ID, they can pre-create a symlink pointing to an arbitrary file, making the task’s stdout write to that file. 36^8 entropy makes this impractical.

Takeaway for your own agent: any system using IDs as file paths must consider ID predictability — “ID not random enough causing race conditions or attacks” is a common production incident.

Resume: `claude --resume` and session storage

Previously compaction mentioned: “Background jobs that summarize previous conversations for the claude --resume feature” — resume’s compaction is pre-computed in background.

Mechanism breakdown:

Session storage: every conversation writes to ~/.claude/projects/<project>/sessions/<sessionId>.jsonl
Background summary agent: after exiting Claude Code, a background job reads the session file and produces a summary
On resume: read session + summary, reconstruct State, enter queryLoop

Resume is near-instant — summary already computed. The comment: “subscribers can use /stats to view usage patterns” — usage data persists, visible cross-session.

Full-chain abort path

toolUseContext.abortController is the central cancellation point threaded through the call chain:

User presses Esc
  → abortController.abort()
  → signal.aborted = true
  → LLM stream interrupted (AbortSignal passed to fetch)
  → all in-flight tools receive signal (tool.execute's second arg)
  → each tool's cleanup (bwrap process kill, file lock release, ...)
  → queryLoop checks signal.aborted → return { reason: 'aborted' }

Key design: each layer observes the signal itself, not waiting for upper-layer notification. LLM fetch natively supports AbortSignal; tool execute’s second arg always includes the signal; bash process wait watches the signal — cancellation is broadcast across the chain, not forwarded layer-by-layer.

Takeaway for your own agent: AbortController must thread from entry point to every leaf operation. Half-assed abort support is worse than none — users think they cancelled but something’s still running.

QueryEngine: the level of a single LLM call

QueryEngine.ts (1295 lines) is the low level of the callModel function, dedicated to streaming consumption of a single LLM call:

Parse SSE events (content_block_start / content_block_delta / content_block_stop / message_delta / …)
Build assistant messages
Handle max_tokens / stop_reason / various API errors
Streaming fallback (discussed above)
Thinking block special handling (signature verification)
Usage tracking (including cache read / cache creation token counts separately)

This layer’s complexity comes from assembling the API’s “low-level event stream” into “high-level assistant messages” while staying cancel-safe / error-safe / partial-state-safe. Not a toy — production-grade streaming API consumer logic is at least 1000+ lines.

Takeaways for building your own agent

Main loop as async generator, not Promise<Response> — real-time streaming to UI is baseline for agent products
Terminal return value must carry reason — different termination reasons drive different UX; opaque undefined can’t support that
Dependency injection: high-frequency mocks (callModel, microcompact, autocompact) as a deps object — otherwise tests must spy 6-8 modules
State fields with clear semantics: 14 fields each answering one question beats an any bucket
Per-turn processing is an explicit pipeline: snip → microcompact → collapse → autocompact → blocking check → model call → tools. The order is a design choice; annotate why this order
Concurrent prefetch: skill / memory / cache params can run in background during LLM streaming — manage with Promise.all or using disposable for lifecycle
Streaming tool executor: tool_use block starts executing as LLM streams it, not waiting for the full assistant message — significantly lower latency for multi-tool scenarios
Tombstone messages as explicit “streamed but void” markers — streaming output can’t be client-side retracted; need an explicit void marker
Model fallback + streaming fallback are two layers — handling API errors vs streaming exceptions
AbortController threads to leaves — each layer watches the signal itself, not forwarded layer-by-layer. Half-assed cancel is worse than none
Task IDs need enough entropy — when IDs become filenames, consider symlink attacks. 36^8 is Claude Code’s choice
Resume’s compaction pre-computed in background — UX-wise resume is instant, not “started when exit happened”