上下文压缩

三个容易混淆的数字

开篇先厘清三类完全不同的 token 数，混用会导致对整套机制的误解：

数字	含义	Claude Code 怎么算
模型上下文窗口	模型硬上限，超过就报错	`getContextWindowForModel(model)` —— Opus 4.7 1M 变体就是 1M
Effective window	扣掉留给摘要输出的空间后的可用窗口	`contextWindow - min(maxOutput, 20000)`
Auto-compact 阈值	自动压缩触发点	`effectiveWindow - 13000` —— 源码常量 `AUTOCOMPACT_BUFFER_TOKENS = 13000`

代入算：

200k 模型（非 1M）：effectiveWindow ≈ 180k → auto-compact 阈值 ≈ 167k
1M 模型（Opus 4.7 1M / Sonnet 1M）：effectiveWindow ≈ 980k → auto-compact 阈值 ≈ 967k

所以”感觉 Claude Code 有 200k 窗口”这个印象是模型相关的——在 200k 模型上的 167k 阈值确实接近 200k；在 1M 模型上阈值并不是 200k。让用户仍然感觉到压力的，是下面要讲的 microcompact——它在 auto-compact 触发很久之前就已经在清工具结果了。

还有一个”20k 留给输出”的数字值得记住：根据 Claude Code 的遥测，压缩摘要输出的 p99.99 是 17,387 tokens——留 20k 是为了这个分布的长尾。这个数字不是拍脑袋，是基于生产数据调的。

源码位置：claude-code/services/compact/autoCompact.ts —— 所有阈值常量、override 环境变量、circuit breaker 逻辑都在这一个文件里。

五级压缩流水线

Claude Code 的压缩不是一个函数。触发路径和应用顺序是：

级别	文件	成本	做什么
1. Microcompact	`services/compact/microCompact.ts`	零 LLM 调用	清理老工具的 result（Read / Bash / Grep / Edit 等 8 种），替换为占位符 `[Old tool result content cleared]`
2. Session Memory	`services/compact/sessionMemoryCompact.ts`	零 LLM 调用	基于已提取的 session memory，修剪到 10k-40k tokens 的窗口
3. Auto-compact	`services/compact/compact.ts`	1 次摘要 LLM 调用	调模型生成 9 段结构化摘要，替换整段历史
4. Reactive compact	`services/compact/reactiveCompact.ts`（feature-flagged）	1 次摘要 LLM 调用	专门应对 API 的 413 prompt_too_long 错误——自动降级后重试
5. Context collapse	`services/contextCollapse/`（feature-flagged）	增量	另一套完整的上下文管理系统，在 90%/95% 阈值触发

/compact 命令走的是 2 → 3 的顺序（session memory 先试、没东西可压才 fallback 到 auto-compact）。 Auto-compact 触发路径走的是 1（每轮）→ 2 → 3。reactive 和 context collapse 是专门的安全网和替代方案。

关键设计原则：先用便宜的办法，只有扛不住才调 LLM。Microcompact 每轮跑、几乎没成本；真正的 LLM 摘要只在 microcompact 救不了场时才启动。

下面逐级拆开。

Tier 1: Microcompact —— 工具结果清理

微压缩的核心思路：大多数 token 预算是被老工具 result 吃掉的——Read 一个 3000 行的文件、Bash 输出一大段 log、 Grep 匹配 500 个结果。这些一旦不再被引用就可以清理，不需要 LLM 介入。

哪些工具算 “可压缩”

源码里有一个显式的 COMPACTABLE_TOOLS 集合：

const COMPACTABLE_TOOLS = new Set<string>([
  FILE_READ_TOOL_NAME,     // Read
  ...SHELL_TOOL_NAMES,     // Bash
  GREP_TOOL_NAME,
  GLOB_TOOL_NAME,
  WEB_SEARCH_TOOL_NAME,
  WEB_FETCH_TOOL_NAME,
  FILE_EDIT_TOOL_NAME,     // Edit
  FILE_WRITE_TOOL_NAME,    // Write
])

注意：TodoWrite / Task / 子 agent 的 delegation result 不在这个集合里——它们是状态型工具，压缩它们会让 agent 忘记任务进度。只有可重新派生的只读/幂等工具才进集合。

两种微压缩路径

路径 A：Time-based microcompact

触发条件：距离上一次 assistant 消息的间隔超过阈值（server-side prompt cache 已过期）。

逻辑：既然 cache 过期整段前缀都要重写，不如趁机把老的工具 result 全换成占位符，让重写后的 prompt 更短。

产出：插入一条 microcompact_boundary 系统消息记录 { trigger, preTokens, tokensSaved, compactedToolIds, clearedAttachmentUUIDs }。

路径 B：Cached microcompact（feature-flagged `CACHED_MICROCOMPACT`）

这个路径更精妙——不在本地 message 数组上改动，而是通过 Anthropic API 的 cache editing 机制只让服务器端的缓存里删掉某些 tool result，本地 prefix 不变，缓存依然命中。

路径 A 的代价是整段前缀重写，路径 B 的代价接近零（只是告诉服务器”删这几个 block”）。但 cache editing 目前是 main thread only + 模型特定（不是所有模型都支持），所以两条路径共存。

源码：services/compact/microCompact.ts 的 cachedMicrocompactPath 和 maybeTimeBasedMicrocompact

图像的特殊处理

图像和 PDF 按固定 2000 tokens估算，无论原始大小。被清理时和文本 tool result 一样替换为占位符，释放的 token 数按 2000 计。

为什么这层是防线的第一道

Microcompact 每轮都跑——比任何 LLM 摘要都早。如果它工作得好，你的 session 可能几小时都不会触发真正的 auto-compact。用户反馈的”感觉有个 200k 左右的软边界”——很大程度上是 microcompact 在默默工作，让窗口占用始终保持在低位。

Tier 2: Session Memory Compaction

这是一套实验性机制（源码注释里明说 “EXPERIMENT”）：基于 Claude Code 已经在后台持续提取的 session memory （跨对话的长期记忆）来做压缩。

配置（DEFAULT_SM_COMPACT_CONFIG）：

{
  minTokens: 10_000,           // 保留至少 10k tokens 的原文
  minTextBlockMessages: 5,     // 至少保留 5 条带文本块的消息
  maxTokens: 40_000,           // 最多保留 40k tokens
}

逻辑：把已经沉淀到 session memory 里的旧对话直接抛弃——因为它们的要点已经在 memory 里存好了，对话历史里存的是冗余。保留的是”尚未被 memory 消化”的最近 10k-40k tokens。

这个路径不支持 custom instructions——用户的 /compact preserve X 会直接跳过这一层，走下一层。原因是 session memory 的内容是已提取的、结构化的，再叠加用户指令会让语义混乱。

/compact（无参数）和 auto-compact 都先试这层。如果 session memory 为空或不适用，fallback 到 Tier 3。

Tier 3: Auto-compact —— LLM 续写摘要

走到这一层说明 microcompact 和 session memory 都救不了场。这是真正调模型做摘要的压缩。

触发函数

// services/compact/autoCompact.ts
export async function shouldAutoCompact(messages, model, querySource, snipTokensFreed) {
  // 递归护栏：querySource === 'session_memory' | 'compact' 直接拒绝
  // （这些 forked agent 如果再递归触发就会死锁）

  if (!isAutoCompactEnabled()) return false

  const tokenCount = tokenCountWithEstimation(messages) - snipTokensFreed
  const threshold = getAutoCompactThreshold(model)  // effectiveWindow - 13000
  return tokenCount >= threshold
}

注意几个细节：

snipTokensFreed：REPL 的 “snip” 操作先删了一些消息但 API 侧的 usage 还反映未删状态，这个参数补回减免量
递归护栏：compaction 本身是一次 forked LLM 调用，如果它的上下文也爆了再递归触发就会死锁，所以 querySource 被标记过的路径直接拒绝

摘要 prompt —— 9 个结构化切面

真实的 prompt（源码 services/compact/prompt.ts 的 BASE_COMPACT_PROMPT）要求模型产出 9 段结构化内容：

#	切面	强调点
1	Primary Request and Intent	详细捕获所有用户显式请求
2	Key Technical Concepts	讨论过的技术栈、框架
3	Files and Code Sections	改过 / 读过的文件，带完整代码片段，最近消息要特别关注
4	Errors and fixes	每个错误 + 怎么修的 + 用户对这个错误的反馈
5	Problem Solving	已解决的问题 + 正在进行的排查
6	All user messages	所有非工具 result 的用户消息——用来识别 feedback 和意图变化
7	Pending Tasks	用户明确让你做但还没做的事
8	Current Work	被压缩打断前正在做什么，精确描述
9	Optional Next Step	下一步动作——必须带最近对话的原文引用，防止理解漂移

为什么是这 9 段（源码里的隐含逻辑）：

切面 1 / 6 / 7 回答”任务是什么”——不丢任务目标
切面 3 回答”在哪里做”——保留文件路径 + 代码片段，让续接的 agent 能定位
切面 4 回答”已经踩过什么坑”——别走回头路
切面 8 / 9 回答”接着做什么”——下一步要具体到文件和函数

切面 9 特别关键：“include direct quotes from the most recent conversation showing exactly what task you were working on”——Claude Code 强制要求摘要里包含最近对话的原文引用，防止 LLM 在摘要中改写用户意图。

`<analysis>` + `<summary>` 双块输出

模型的输出不是直接的 9 段，而是两个 XML 块：

<analysis>
[思考过程 —— 草稿纸，摘要质量更高但没有长期价值]
</analysis>

<summary>
1. Primary Request and Intent: ...
2. Key Technical Concepts: ...
...
</summary>

formatCompactSummary() 会把 <analysis> 块直接删掉——它只是给 LLM 做 chain-of-thought 的草稿纸。保留的只有 <summary> 的 9 段内容。

这是一种**“思考时丰富，落盘时精简”**的 prompt 工程手法——既能拿到 chain-of-thought 的质量提升，又不让草稿污染 context。

`NO_TOOLS_PREAMBLE` —— 硬堵工具调用

压缩的摘要调用是一次 maxTurns: 1 的 LLM 调用。如果这一轮模型调了工具（即使合法），就拿不到文本摘要——整次压缩就失败了。

源码里有一段触目惊心的遥测数据作为注释：

The cache-sharing fork path inherits the parent’s full tool set (required for cache-key match), and on Sonnet 4.6+ adaptive-thinking models the model sometimes attempts a tool call despite the weaker trailer instruction. With maxTurns: 1, a denied tool call means no text output → falls through to the streaming fallback (2.79% on 4.6 vs 0.01% on 4.5).

Sonnet 4.6 的工具尝试率是 2.79%（相比 4.5 的 0.01%）。为了对抗这个，prompt 里前后都有硬堵：

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.
- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.

[... 主 prompt ...]

REMINDER: Do NOT call any tools. Respond with plain text only — ...
Tool calls will be rejected and you will fail the task.

给自研 agent 的启示：prompt 夹心法——在长 prompt 的前后都重复关键约束。中间的细则（9 个切面的要求）不会让模型忘了”不要调工具”这条硬约束。遥测数据决定了这个 prompt 必须这么写，不是风格偏好。

三个 prompt 变体

源码里其实有三个不同的摘要 prompt，对应三种压缩场景：

变体	用在哪	语义
`BASE_COMPACT_PROMPT`	全量压缩	”summarize the conversation so far”
`PARTIAL_COMPACT_PROMPT`	压 recent，older 保留	”summarize the RECENT portion… The earlier messages are being kept intact and do NOT need to be summarized”
`PARTIAL_COMPACT_UP_TO_PROMPT`	压 older，newer 保留	”This summary will be placed at the start of a continuing session; newer messages that build on this context will follow after your summary”

三个变体的 9 段结构相同，但定位不同——模型需要知道”这段摘要在最终 context 里的位置”，才能合理决定”哪些背景信息必须写进摘要”。这是 prompt 清晰定位的范例。

Tier 4: Reactive Compact —— 413 Fallback

Reactive compaction（feature-flagged REACTIVE_COMPACT）是专门对付API 413 “prompt_too_long” 错误的最后安全网。

场景：auto-compact 算的阈值是估算值（tokenCountWithEstimation），估算有误差。如果实际 API token 数超过模型硬上限， API 会返回 413，整次调用失败。

Reactive 的逻辑：收到 413 就自动触发一次压缩，然后重试。用户不会看到失败——他们只会感觉这一轮响应慢了一点。

这也是为什么 auto-compact 的 MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 的熔断器存在——如果 context 真的irrecoverably 超了（reactive 自己也压不下来），再反复重试就是浪费 API 调用。

注释里有一条 BigQuery 数据：“1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally”——熔断器加进来之前，无限重试每天浪费 25 万次 API 调用。

Tier 5: Context Collapse（简述）

Context collapse（feature-flagged CONTEXT_COLLAPSE）是另一套独立的上下文管理系统，在 90% commit / 95% blocking 阈值上触发——比 auto-compact 的 effectiveWindow - 13k 更细粒度。

当 context collapse 启用时，auto-compact 在 shouldAutoCompact 里主动退让：

if (feature('CONTEXT_COLLAPSE')) {
  if (isContextCollapseEnabled()) {
    return false  // collapse 自己会处理
  }
}

源码注释解释了原因：“Autocompact firing at effective-13k (~93% of effective) sits right between collapse’s commit-start (90%) and blocking (95%), so it would race collapse and usually win, nuking granular context that collapse was about to save.”

两套系统相互避让，避免”A 想保的东西被 B 压掉”。这是多系统共存时的关键设计原则——显式的 ownership handoff，不是两个都跑。

Boundary 标记：压缩后的历史怎么组织

每次压缩都在消息流里插入一条特殊的boundary message：

// utils/messages.ts
createCompactBoundaryMessage(
  trigger: 'manual' | 'auto',
  preTokens: number,
  lastPreCompactMessageUuid?: UUID,
  userContext?: string,
  messagesSummarized?: number,
)
// → { type: 'system', subtype: 'compact_boundary', compactMetadata: {...} }

Microcompact 有自己的变体：

createMicrocompactBoundaryMessage(
  trigger: 'auto',
  preTokens: number,
  tokensSaved: number,
  compactedToolIds: string[],
  clearedAttachmentUUIDs: string[],
)
// → { type: 'system', subtype: 'microcompact_boundary', microcompactMetadata: {...} }

Boundary 的作用有三个：

UI 显示：显式告诉用户”这里发生过压缩”（'Conversation compacted' / 'Context microcompacted'）
重压缩时的边界：getMessagesAfterCompactBoundary(messages) 扫描到最近的 boundary，只压缩 boundary 之后的消息——已压缩过的部分不再重复压
审计与遥测：compactMetadata 记录了 trigger / preTokens / messagesSummarized，用于事后分析压缩效果

给自研 agent 的启示：压缩是消息流里的一个事件，不是上下文重置。用显式 boundary 记录它，下游所有代码（UI、下一次压缩、审计）都能基于这个边界做决策。

PreCompact Hook —— 可编程的压缩入口

Claude Code 有一个 PreCompact 钩子事件，在摘要 LLM 调用前触发：

// commands/compact/compact.ts
const [hookResult, cacheSafeParams] = await Promise.all([
  executePreCompactHooks(
    { trigger: 'manual', customInstructions: customInstructions || null },
    context.abortController.signal,
  ),
  getCacheSharingParams(context, messages),
])

const mergedInstructions = mergeHookInstructions(
  customInstructions,
  hookResult.newCustomInstructions,
)

Hook 能做两件事：

追加自定义指令 —— hookResult.newCustomInstructions 会和用户的 /compact X 参数合并，一起传给摘要 LLM
返回用户可见信息 —— hookResult.userDisplayMessage 会拼到压缩完成的 UI 通知里

典型用途：注入”这次压缩必须保留 X”的项目特定规则——比如”保留所有关于 auth 设计的决策”、“保留 DB schema 相关的代码片段”。这些规则不适合写进 CLAUDE.md（压缩每次都跑，CLAUDE.md 不是动态的），也不适合每次手动输入，用 hook 是自然解。

Hook 和 getCacheSharingParams 并发执行（Promise.all）——cache 计算要遍历所有工具构造 system prompt，hook 是 spawn 子进程。两者独立，没必要串行。这是源码里的一个小优化。

注：源码里还有 PostCompact 的概念（markPostCompaction()、runPostCompactCleanup()、usePostCompactSurvey），完整的 hook 生命周期覆盖压缩前、压缩中、压缩后。

摘要怎么回注

摘要 LLM 产出的文本通过 getCompactUserSummaryMessage 包装成一条 user message 注入回对话：

// services/compact/prompt.ts
let baseSummary = `This session is being continued from a previous conversation that ran out of context.
The summary below covers the earlier portion of the conversation.

${formattedSummary}`

if (transcriptPath) {
  baseSummary += `\n\nIf you need specific details from before compaction (like exact code snippets,
  error messages, or content you generated), read the full transcript at: ${transcriptPath}`
}

if (recentMessagesPreserved) {
  baseSummary += `\n\nRecent messages are preserved verbatim.`
}

三个细节值得注意：

显式交代”这是延续” —— 避免模型以为这是新对话，首句就把”This session is being continued”摆出来
transcript 路径回指 —— 如果摘要有遗漏的细节，模型可以 Read 完整 transcript 文件取回。压缩不是删除，落盘的原文还在
“Recent messages are preserved verbatim” —— 当部分压缩（PARTIAL_COMPACT）时告知模型”最近的消息是原文”——否则模型可能把原文也当作摘要来处理

对于自主模式（PROACTIVE / KAIROS feature flags），还会追加：

You are running in autonomous/proactive mode. This is NOT a first wake-up — you were already
working autonomously before compaction. Continue your work loop: pick up where you left off
based on the summary above. Do not greet the user or ask what to work on.

专门告诉自主模式的 agent”这不是第一次被唤醒，是续接——不要打招呼、不要问干什么”。

Cache 协调：压缩对 prompt cache 的影响

压缩是 cache 的敌人——重写了对话历史，前缀就变了。Claude Code 有两个机制降低损失：

1. `notifyCompaction` —— 避免 cache miss 误报警

// services/api/promptCacheBreakDetection.ts
notifyCompaction(querySource, agentId)

Claude Code 有一套 prompt cache break detection 遥测系统，检测到意外的 cache miss 会告警。压缩本身就会让 cache 读数降为 0，如果不提前通知告警系统，每次压缩都会误触发 cache miss 告警。

源码里有一条 BigQuery 数据佐证：“BQ 2026-03-01: missing this made 20% of tengu_prompt_cache_break events false positives”——不加这个 notify 的话，20% 的 cache break 告警都是误报。

2. Cached microcompact —— 零 cache 失效的压缩

前面讲过的路径 B：通过 cache editing API 让服务器端的缓存里删掉特定 tool result blocks，本地 prefix 不变。这样压缩后的第一轮 API 调用依然命中 cache。

给自研 agent 的启示：任何动消息流的操作都要考虑 cache 影响。有些操作（如追加 system reminder）影响局部；有些操作（如重写历史）影响全局。至少要有告警系统的 opt-out 通知，高级一点的可以做服务器端 cache editing。

熔断器 + 环境变量 —— 可调参数

// services/compact/autoCompact.ts
export const AUTOCOMPACT_BUFFER_TOKENS = 13_000
export const WARNING_THRESHOLD_BUFFER_TOKENS = 20_000
export const ERROR_THRESHOLD_BUFFER_TOKENS = 20_000
export const MANUAL_COMPACT_BUFFER_TOKENS = 3_000
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3
const MAX_OUTPUT_TOKENS_FOR_SUMMARY = 20_000

环境变量 override：

变量	作用
`DISABLE_COMPACT`	全关
`DISABLE_AUTO_COMPACT`	关自动、留手动
`CLAUDE_CODE_AUTO_COMPACT_WINDOW`	压缩使用更小的 effective window
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE`	测试用：按百分比触发
`CLAUDE_CODE_BLOCKING_LIMIT_OVERRIDE`	测试用：覆盖”完全阻塞”阈值

用户 settings.json 里还有 autoCompactEnabled 配置。

Session Memory Compact 的三个可调参数通过 GrowthBook 远程配置动态拉取：

DEFAULT_SM_COMPACT_CONFIG = {
  minTokens: 10_000,
  minTextBlockMessages: 5,
  maxTokens: 40_000,
}

这意味着 Anthropic 可以不发版就调整所有用户的压缩策略——对一个跑在成千上万台机器上的工具来说，这是必需的运维能力。

失败模式与错误消息

源码明确定义了三类 compaction 失败：

export const ERROR_MESSAGE_NOT_ENOUGH_MESSAGES = '...'     // 消息太少，不值得压缩
export const ERROR_MESSAGE_INCOMPLETE_RESPONSE = '...'     // 摘要 LLM 响应不完整
export const ERROR_MESSAGE_USER_ABORT = '...'              // 用户 Ctrl+C 取消

reactive 路径的失败原因更细：

'too_few_groups' | 'aborted' | 'exhausted' | 'error' | 'media_unstrippable'

media_unstrippable 特别有意思——意味着某些 media block（图像、附件）不能被清理，反而让 reactive 压不动。这是 compaction 碰到”无法减肥”的 edge case，值得记下来：压缩的前提是内容可以被替换或删除——有些 API 层面的强制保留项会打破这个假设。

`/compact` `/clear` `/rewind` —— 用户侧的三档控制

Claude Code 暴露给用户的不只是 /compact：

命令	做什么	底层机制
`/compact [instructions]`	手动触发 SM compact → fallback 到 LLM 摘要	Tier 2 + 3
`/clear`	清空对话历史，保留 system prompt / CLAUDE.md / memory	无压缩，直接截断
`/rewind`	回滚到上一个 checkpoint（对话 + 代码）	Git-like 快照机制
`claude --resume`	从上次退出处继续，后台已经预压缩过	源码注释：“Background jobs that summarize previous conversations for the `claude --resume` feature”

最后一条特别值得注意：resume 时的压缩是后台预算好的——这是为什么 resume 一个几小时前的 session 几乎是即时的，不是你 resume 那一刻才开始压缩。

用户可以在 CLAUDE.md 里加一段让 /compact 总是带这些指令：

# Compact instructions

When you are using compact, please focus on test output and code changes

这本质上是项目级的压缩偏好——比 hook 轻，比手动输入自动化程度高。

给自研 agent 的要点

压缩不是一个函数，是分层流水线。至少三层：tool-result 清理（零 LLM）→ 选择性原始保留（session memory 风格）→ LLM 摘要。每层扛不住才走下一层，成本单调递增
阈值是 effectiveWindow - buffer，不是”模型窗口的某个百分比”。Buffer 要留给摘要的输出本身——20k 是 Claude Code 基于遥测调的经验值（p99.99）
摘要 prompt 必须结构化。Claude Code 的 9 段不是随意划分的——每段回答”任务是什么 / 在哪做 / 做到哪 / 下一步”这些关键问题。切面 6（所有用户消息）和切面 9（Next step 带原文引用）是抗意图漂移的关键
<analysis> + <summary> 双块。让 LLM 在 <analysis> 里做 chain-of-thought，<summary> 才是落盘的。事后 strip 掉 <analysis>——既能拿到思考质量，又不污染 context
NO_TOOLS_PREAMBLE 夹心：长 prompt 的前后都要重复硬约束。模型会”遗忘”中间的指令
Boundary message：用显式的系统消息标记压缩事件，而不是静默地改对话数组。UI / 下次压缩 / 审计都依赖这个边界
PreCompact hook：给压缩留一个可编程入口——项目方可以注入”必须保留 X”的规则，比写进 CLAUDE.md 更灵活
Circuit breaker：连续失败 N 次就停。否则一个 irrecoverable 的 session 可能浪费 250K 次 API 调用一天（Claude Code 的真实教训）
Cache editing API 能做零失效压缩：如果你在用 Anthropic API，cache_edits 机制让你可以删掉缓存中的 blocks 而不 invalidate 前缀
压缩和告警系统协调：压缩天然会触发 cache miss，告警系统要能区分”真正的 cache break”和”压缩导致的”，否则 20% 告警都是噪音（Claude Code 的真实数据）