Compaction

Three numbers not to conflate

Three different token counts whose mix-up misleads the whole mental model:

Number	Meaning	How Claude Code computes it
Model context window	Hard ceiling — exceeding it errors out	`getContextWindowForModel(model)` — 1M for Opus 4.7’s 1M variant
Effective window	Usable window after reserving room for summary output	`contextWindow - min(maxOutput, 20000)`
Auto-compact threshold	Where auto-compaction fires	`effectiveWindow - 13000` — source constant `AUTOCOMPACT_BUFFER_TOKENS = 13000`

Doing the math:

200k-window model (non-1M): effectiveWindow ≈ 180k → auto-compact threshold ≈ 167k
1M-window model (Opus 4.7 1M / Sonnet 1M): effectiveWindow ≈ 980k → auto-compact threshold ≈ 967k

So the “feels like Claude Code has a 200k window” sense is model-specific — on a 200k model, the 167k threshold really does feel ~200k-ish; on a 1M model the threshold is nowhere near 200k. What makes users still feel pressure on 1M models is microcompact (covered below) — it clears old tool results long before auto-compact fires.

The “20k reserved for output” number deserves its own note: per Claude Code telemetry, the p99.99 of compaction summary output is 17,387 tokens. Reserving 20k covers the long tail. That number isn’t a guess — it’s tuned from production data.

Source: claude-code/services/compact/autoCompact.ts — all threshold constants, env overrides, and the circuit breaker live in one file.

Five-tier compaction pipeline

Claude Code’s compaction isn’t one function. The trigger paths and their application order are:

Tier	File	Cost	What it does
1. Microcompact	`services/compact/microCompact.ts`	Zero LLM calls	Clears old tool results (Read / Bash / Grep / Edit — 8 tools), replaces with placeholder `[Old tool result content cleared]`
2. Session Memory	`services/compact/sessionMemoryCompact.ts`	Zero LLM calls	Trims to a 10k-40k window based on already-extracted session memory
3. Auto-compact	`services/compact/compact.ts`	1 summarization LLM call	Calls the model to produce a 9-section structured summary that replaces the whole history
4. Reactive compact	`services/compact/reactiveCompact.ts` (feature-flagged)	1 summarization LLM call	Handles the API’s 413 prompt_too_long — auto-compacts then retries
5. Context collapse	`services/contextCollapse/` (feature-flagged)	Incremental	A separate context management system; fires at 90% / 95% thresholds

/compact goes through 2 → 3 (try session memory first, fall back to LLM summarization). Auto-compact fires 1 (every turn) → 2 → 3. Reactive and context collapse are specialized safety nets / alternative systems.

Core design principle: cheap methods first; only invoke LLMs when the cheaper tiers can’t hold. Microcompact runs every turn at almost no cost. The real LLM summarization only fires when microcompact can’t save the day.

Each tier unpacks below.

Tier 1: Microcompact — Tool result clearing

Core insight: most of your token budget is eaten by stale tool results — a Read on a 3000-line file, a Bash with a huge log dump, a Grep matching 500 entries. Once those aren’t referenced anymore, they can be cleared without any LLM involvement.

Which tools count as “compactable”

The source has an explicit COMPACTABLE_TOOLS set:

const COMPACTABLE_TOOLS = new Set<string>([
  FILE_READ_TOOL_NAME,     // Read
  ...SHELL_TOOL_NAMES,     // Bash
  GREP_TOOL_NAME,
  GLOB_TOOL_NAME,
  WEB_SEARCH_TOOL_NAME,
  WEB_FETCH_TOOL_NAME,
  FILE_EDIT_TOOL_NAME,     // Edit
  FILE_WRITE_TOOL_NAME,    // Write
])

Note: TodoWrite / Task / subagent delegation results are NOT in this set. They’re stateful — compacting them would make the agent forget task state. Only re-derivable read-only/idempotent tools make the cut.

Two microcompact paths

Path A: Time-based microcompact

Trigger: the gap since the last assistant message exceeds a threshold (server-side prompt cache has expired).

Logic: since the cache is cold and the whole prefix will be rewritten anyway, clear old tool results first so the rewrite is shorter.

Output: insert a microcompact_boundary system message recording { trigger, preTokens, tokensSaved, compactedToolIds, clearedAttachmentUUIDs }.

Path B: Cached microcompact (feature-flagged `CACHED_MICROCOMPACT`)

This path is more surgical — doesn’t modify local message content. Instead, it uses the Anthropic API’s cache editing mechanism to delete tool results only from the server-side cache. Local prefix stays the same, the cache still hits on the next call.

Path A’s cost is rewriting the entire prefix. Path B’s cost is near zero (it’s just telling the server “delete these blocks”). But cache editing is currently main-thread only + model-specific, so both paths coexist.

Source: services/compact/microCompact.ts — cachedMicrocompactPath and maybeTimeBasedMicrocompact

Images handled specially

Images and PDFs are estimated at a fixed 2000 tokens regardless of size. When they get cleared in microcompact, the placeholder stands in for the text equivalent; the saved token count is attributed at 2000.

Why this tier is the first line of defense

Microcompact runs every turn — earlier than any LLM summarization. When it’s working well, your session may go hours without triggering a real auto-compact. The user impression of “a soft ~200k ceiling” is largely microcompact quietly working in the background to keep window occupancy low.

Tier 2: Session Memory Compaction

This is an experimental mechanism (the source comment says “EXPERIMENT”): compaction based on Claude Code’s session memory — long-term memory continuously extracted from prior conversations in the background.

Config (DEFAULT_SM_COMPACT_CONFIG):

{
  minTokens: 10_000,           // preserve at least 10k tokens of raw text
  minTextBlockMessages: 5,     // preserve at least 5 messages with text blocks
  maxTokens: 40_000,           // hard cap at 40k tokens preserved
}

Logic: drop older conversation turns that have already been distilled into session memory — those points are safely captured in memory; the conversation-layer copy is redundant. Keep the 10k-40k tokens of most recent content that memory hasn’t yet absorbed.

This path doesn’t support custom instructions — a user’s /compact preserve X skips this tier and goes straight to the next. Reason: session memory content is already extracted and structured; overlaying user instructions would muddy the semantics.

Both /compact (no args) and auto-compact try this tier first. If session memory is empty or inapplicable, they fall back to Tier 3.

Tier 3: Auto-compact — LLM summarization

Reaching this tier means microcompact and session memory couldn’t save us. This is the real LLM summarization.

Trigger function

// services/compact/autoCompact.ts
export async function shouldAutoCompact(messages, model, querySource, snipTokensFreed) {
  // Recursion guard: querySource === 'session_memory' | 'compact' rejects directly.
  // (These are forked agents — re-triggering compaction recursively would deadlock.)

  if (!isAutoCompactEnabled()) return false

  const tokenCount = tokenCountWithEstimation(messages) - snipTokensFreed
  const threshold = getAutoCompactThreshold(model)  // effectiveWindow - 13000
  return tokenCount >= threshold
}

Details that matter:

snipTokensFreed: the REPL’s “snip” already dropped some messages but the API-side usage still reflects the pre-snip state; this parameter adjusts for the phantom savings
Recursion guard: compaction itself is a forked LLM call; if its context also overflowed and re-triggered compaction, it would deadlock. Marked query sources bypass trigger entirely

The summary prompt — 9 structured sections

The real prompt (source: services/compact/prompt.ts, BASE_COMPACT_PROMPT) demands 9 structured sections:

#	Section	Emphasis
1	Primary Request and Intent	Capture all explicit user requests in detail
2	Key Technical Concepts	Stack and frameworks discussed
3	Files and Code Sections	Files modified / read, with full code snippets; emphasize recent messages
4	Errors and fixes	Each error + fix + user feedback on that error
5	Problem Solving	Solved problems + ongoing troubleshooting
6	All user messages	All non-tool-result user messages — used to detect feedback and intent changes
7	Pending Tasks	Tasks the user explicitly asked for but haven’t been done
8	Current Work	Precisely what was being worked on at compaction time
9	Optional Next Step	Next action — must include direct quotes from the most recent conversation to prevent drift

Why these 9 (implicit logic from source):

Sections 1 / 6 / 7 answer “what’s the task” — task objectives don’t get lost
Section 3 answers “where” — file paths + code snippets so the resuming agent can relocate
Section 4 answers “what pits have already been stepped in” — no revisiting dead ends
Sections 8 / 9 answer “what’s next” — next step gets specific about files and functions

Section 9 is especially critical: “include direct quotes from the most recent conversation showing exactly what task you were working on”. Claude Code enforces verbatim quotation from the recent dialogue to prevent the summary from drifting the user’s intent.

`<analysis>` + `<summary>` dual-block output

The model’s output isn’t 9 sections directly — it’s two XML blocks:

<analysis>
[Thinking process — scratchpad, improves summary quality but has no long-term value]
</analysis>

<summary>
1. Primary Request and Intent: ...
2. Key Technical Concepts: ...
...
</summary>

formatCompactSummary() strips the <analysis> block — it’s chain-of-thought scratchpad for the LLM. Only the <summary> 9 sections survive.

This is a “rich during thinking, slim when stored” prompt-engineering technique — you get chain-of-thought quality without scratchpad polluting context.

`NO_TOOLS_PREAMBLE` — hard block on tool calls

The compaction summary call is a maxTurns: 1 LLM call. If this one turn makes a tool call (even a legitimate one), no text summary is returned — the whole compaction fails.

The source has a striking piece of telemetry as a comment:

The cache-sharing fork path inherits the parent’s full tool set (required for cache-key match), and on Sonnet 4.6+ adaptive-thinking models the model sometimes attempts a tool call despite the weaker trailer instruction. With maxTurns: 1, a denied tool call means no text output → falls through to the streaming fallback (2.79% on 4.6 vs 0.01% on 4.5).

Sonnet 4.6 tool-call attempts at 2.79% (vs 4.5’s 0.01%). To counter, the prompt has a sandwich of hard blocks at both ends:

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.
- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.

[... main prompt ...]

REMINDER: Do NOT call any tools. Respond with plain text only — ...
Tool calls will be rejected and you will fail the task.

Takeaway for your own agent: the prompt sandwich — repeat the critical constraint at the start and end. The details in the middle (the 9-section requirements) won’t let the model forget the hard “no tool calls” rule. The telemetry data justified this style — it’s not aesthetic preference.

Three prompt variants

The source actually has three summarization prompts for different compaction scenarios:

Variant	Used when	Semantics
`BASE_COMPACT_PROMPT`	Full compaction	”summarize the conversation so far”
`PARTIAL_COMPACT_PROMPT`	Compact recent, older kept	”summarize the RECENT portion… The earlier messages are being kept intact and do NOT need to be summarized”
`PARTIAL_COMPACT_UP_TO_PROMPT`	Compact older, newer kept	”This summary will be placed at the start of a continuing session; newer messages that build on this context will follow after your summary”

The 9-section structure is identical but the positioning differs — the model needs to know “where this summary will sit in the final context” to decide “which background info must make it in.” A textbook example of prompt position-awareness.

Tier 4: Reactive Compact — 413 Fallback

Reactive compaction (feature-flagged REACTIVE_COMPACT) is the safety net for API 413 “prompt_too_long” errors.

Scenario: auto-compact’s threshold uses an estimated token count (tokenCountWithEstimation); estimation has error. If the actual API token count exceeds the model’s hard ceiling, the API returns 413 and the whole call fails.

Reactive’s logic: on a 413, auto-trigger a compaction and retry. Users don’t see the failure — they just notice this turn’s response was a bit slower.

This is why auto-compact’s MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 circuit breaker exists — if the context is irrecoverably large (reactive can’t compact it either), retrying is just wasted API calls.

The source comment cites BigQuery data: “1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally”. Before the circuit breaker, 250K doomed API calls a day, globally.

Tier 5: Context Collapse (brief)

Context collapse (feature-flagged CONTEXT_COLLAPSE) is an independent context management system, triggering at 90% commit / 95% blocking thresholds — finer-grained than auto-compact’s effectiveWindow - 13k.

When context collapse is enabled, auto-compact stands down in shouldAutoCompact:

if (feature('CONTEXT_COLLAPSE')) {
  if (isContextCollapseEnabled()) {
    return false  // collapse handles it
  }
}

The source comment explains: “Autocompact firing at effective-13k (~93% of effective) sits right between collapse’s commit-start (90%) and blocking (95%), so it would race collapse and usually win, nuking granular context that collapse was about to save.”

The two systems explicitly yield to each other to avoid “A preserves X which B then compacts away.” This is a key principle for multi-system coexistence — explicit ownership handoff, not running both.

Boundary markers: how post-compaction history is organized

Every compaction inserts a special boundary message into the message stream:

// utils/messages.ts
createCompactBoundaryMessage(
  trigger: 'manual' | 'auto',
  preTokens: number,
  lastPreCompactMessageUuid?: UUID,
  userContext?: string,
  messagesSummarized?: number,
)
// → { type: 'system', subtype: 'compact_boundary', compactMetadata: {...} }

Microcompact has its own variant:

createMicrocompactBoundaryMessage(
  trigger: 'auto',
  preTokens: number,
  tokensSaved: number,
  compactedToolIds: string[],
  clearedAttachmentUUIDs: string[],
)
// → { type: 'system', subtype: 'microcompact_boundary', microcompactMetadata: {...} }

Three roles for boundaries:

UI display: explicitly tells the user “a compaction happened here” ('Conversation compacted' / 'Context microcompacted')
Re-compaction boundary: getMessagesAfterCompactBoundary(messages) scans back to the most recent boundary and only compacts messages after it — already-compacted portions don’t get re-compacted
Audit and telemetry: compactMetadata records trigger / preTokens / messagesSummarized for post-hoc analysis

Takeaway for your own agent: compaction is an event in the message stream, not a context reset. Mark it with an explicit boundary; downstream code (UI, next compaction, audit) can all make decisions against that boundary.

PreCompact Hook — a programmable entry point

Claude Code has a PreCompact hook event that fires before the summarization LLM call:

// commands/compact/compact.ts
const [hookResult, cacheSafeParams] = await Promise.all([
  executePreCompactHooks(
    { trigger: 'manual', customInstructions: customInstructions || null },
    context.abortController.signal,
  ),
  getCacheSharingParams(context, messages),
])

const mergedInstructions = mergeHookInstructions(
  customInstructions,
  hookResult.newCustomInstructions,
)

The hook can do two things:

Append custom instructions — hookResult.newCustomInstructions merges with the user’s /compact X arg into the summary LLM prompt
Return user-visible text — hookResult.userDisplayMessage joins the compaction-complete UI notification

Typical use: inject project-specific “must preserve X” rules — e.g., “preserve all decisions about auth design,” “preserve DB schema code snippets.” These don’t belong in CLAUDE.md (compaction is event-based, CLAUDE.md is static) and don’t fit manual input every time. A hook is the natural fit.

Hooks and getCacheSharingParams run concurrently (Promise.all) — cache computation walks all tools to build the system prompt; hooks spawn subprocesses. Independent, no need to serialize. A small source-level optimization.

Note: the source also has PostCompact (markPostCompaction(), runPostCompactCleanup(), usePostCompactSurvey) — the hook lifecycle covers pre / during / post compaction.

How the summary gets re-injected

The summary from the LLM is wrapped by getCompactUserSummaryMessage as a user message and inserted back into the conversation:

// services/compact/prompt.ts
let baseSummary = `This session is being continued from a previous conversation that ran out of context.
The summary below covers the earlier portion of the conversation.

${formattedSummary}`

if (transcriptPath) {
  baseSummary += `\n\nIf you need specific details from before compaction (like exact code snippets,
  error messages, or content you generated), read the full transcript at: ${transcriptPath}`
}

if (recentMessagesPreserved) {
  baseSummary += `\n\nRecent messages are preserved verbatim.`
}

Three details worth noting:

Explicit “this is a continuation” — the very first line says “This session is being continued” to prevent the model from treating this as a new conversation
Transcript path backpointer — if the summary missed something, the model can Read the full transcript file to retrieve it. Compaction isn’t deletion; the raw text on disk is still there
“Recent messages are preserved verbatim” — for partial compaction, tell the model “the recent messages are raw” so it doesn’t treat them as more summary

For autonomous mode (PROACTIVE / KAIROS feature flags), an extra line is appended:

You are running in autonomous/proactive mode. This is NOT a first wake-up — you were already
working autonomously before compaction. Continue your work loop: pick up where you left off
based on the summary above. Do not greet the user or ask what to work on.

Explicitly tells the autonomous-mode agent “this is not a fresh wake-up, it’s a continuation — don’t greet the user, don’t ask what to do.”

Cache coordination: impact on prompt cache

Compaction is the cache’s enemy — it rewrites conversation history, changing the prefix. Claude Code has two mechanisms to limit the damage:

1. `notifyCompaction` — avoid cache miss false alarms

// services/api/promptCacheBreakDetection.ts
notifyCompaction(querySource, agentId)

Claude Code has a prompt cache break detection telemetry system — unexpected cache misses trigger an alert. Compaction itself drops cache reads to zero; without notifying the alert system, every compaction would falsely trigger cache-miss alerts.

The source cites BigQuery data: “BQ 2026-03-01: missing this made 20% of tengu_prompt_cache_break events false positives”. Without this notify, 20% of cache break alerts were noise.

2. Cached microcompact — zero-invalidation compaction

Path B from earlier: through the cache editing API, delete specific tool result blocks from the server-side cache while leaving local prefix unchanged. Next API call after compaction still hits the cache.

Takeaway for your own agent: any operation that touches the message stream must consider cache impact. Some operations (like appending system reminders) are local; others (like rewriting history) are global. At minimum, your alert system needs an opt-out signal; more advanced would be server-side cache editing.

Circuit breaker + environment overrides — the adjustable knobs

// services/compact/autoCompact.ts
export const AUTOCOMPACT_BUFFER_TOKENS = 13_000
export const WARNING_THRESHOLD_BUFFER_TOKENS = 20_000
export const ERROR_THRESHOLD_BUFFER_TOKENS = 20_000
export const MANUAL_COMPACT_BUFFER_TOKENS = 3_000
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3
const MAX_OUTPUT_TOKENS_FOR_SUMMARY = 20_000

Environment overrides:

Variable	Effect
`DISABLE_COMPACT`	Fully off
`DISABLE_AUTO_COMPACT`	Off for auto, manual still works
`CLAUDE_CODE_AUTO_COMPACT_WINDOW`	Force a smaller effective window
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE`	Testing: percentage-based trigger
`CLAUDE_CODE_BLOCKING_LIMIT_OVERRIDE`	Testing: override the blocking limit

User settings.json also exposes autoCompactEnabled.

Session Memory Compact’s three parameters are pulled from GrowthBook (remote config):

DEFAULT_SM_COMPACT_CONFIG = {
  minTokens: 10_000,
  minTextBlockMessages: 5,
  maxTokens: 40_000,
}

This means Anthropic can tune compaction strategy without shipping a new build — for a tool running on tens of thousands of machines, this is essential ops capability.

Failure modes and error messages

The source explicitly defines three compaction failure classes:

export const ERROR_MESSAGE_NOT_ENOUGH_MESSAGES = '...'     // too little to compact
export const ERROR_MESSAGE_INCOMPLETE_RESPONSE = '...'     // summary LLM response incomplete
export const ERROR_MESSAGE_USER_ABORT = '...'              // user Ctrl+C

Reactive’s failure reasons are more granular:

'too_few_groups' | 'aborted' | 'exhausted' | 'error' | 'media_unstrippable'

media_unstrippable is especially interesting — it means some media block (image, attachment) can’t be cleared, so reactive can’t reduce the context enough. This is the edge case of “compaction needs content to be removable — some API-level required blocks break that assumption.”

`/compact` `/clear` `/rewind` — three user-side controls

Claude Code exposes more than just /compact:

Command	What it does	Underlying mechanism
`/compact [instructions]`	Manually trigger SM compact → fallback to LLM summary	Tiers 2 + 3
`/clear`	Clears conversation history; system prompt / CLAUDE.md / memory preserved	No compaction, plain truncation
`/rewind`	Restore to a previous checkpoint (conversation + code)	Git-like snapshot
`claude --resume`	Resume from last exit; pre-compacted in the background	Source comment: “Background jobs that summarize previous conversations for the `claude --resume` feature”

That last one matters: resume-time compaction is pre-computed in the background. That’s why resuming a session from hours ago is near-instant — compaction didn’t start when you hit resume.

Users can also put compaction preferences directly in CLAUDE.md:

# Compact instructions

When you are using compact, please focus on test output and code changes

Essentially project-level compaction preferences — lighter than a hook, more automated than manual entry each time.

Takeaways for building your own agent

Compaction is layered, not a single function. Minimum three layers: tool-result clearing (zero LLM) → selective raw preservation (session-memory style) → LLM summarization. Each tier kicks in only when the prior can’t hold; cost monotonically increases
Thresholds are effectiveWindow - buffer, not “percentage of model window.” Buffer must leave room for the summary output itself — 20k is Claude Code’s empirical value from telemetry (p99.99)
Summary prompts must be structured. Claude Code’s 9 sections aren’t arbitrary — each answers “what’s the task / where am I / how far along / what’s next.” Section 6 (all user messages) and section 9 (next step with quoted context) are key anti-drift measures
<analysis> + <summary> dual blocks. Let the LLM do chain-of-thought in <analysis>; only <summary> is kept. Strip <analysis> post-generation — you get thinking quality without polluting context
NO_TOOLS_PREAMBLE sandwich: repeat hard constraints at both ends of a long prompt. Models “forget” middle-of-prompt rules
Boundary messages: mark compaction events with explicit system messages, not silent array mutation. UI, next compaction, and audit all rely on the boundary
PreCompact hook: leave compaction a programmable entry point — project maintainers can inject “must preserve X” rules, more flexible than CLAUDE.md
Circuit breaker: N consecutive failures → stop. Otherwise an irrecoverable session can burn 250K API calls a day (Claude Code’s real-world lesson)
Cache editing API enables zero-invalidation compaction: with the Anthropic API’s cache_edits mechanism, you can delete blocks from the cache without invalidating the prefix
Compaction + alert system coordination: compaction naturally triggers cache misses. Your alert system must distinguish “real cache break” from “compaction-induced,” or 20% of alerts are noise (Claude Code’s real data)