Compaction
The operation that replaces window content with a more compact form — spectrum, triggers, preservation, custom-instruction tuning, design extensions, measurement, and a cross-framework reference
Why Compaction Matters
Long-running agents fill their context windows. Without compaction they either crash at the hard limit or degrade gradually (context rot) as the window grows past the model’s effective working range. Compaction is the operation that replaces some of the window’s content with a more compact representation so the agent can keep working.
Every other technique in this section — caching, just-in-time loading, sub-agent isolation — reduces how much context enters the window. Compaction is what you do when it enters anyway.
Design Decisions at a Glance
Seven decisions shape a compaction design, roughly in this order:
- Do you need compaction? — usage profile. Short tasks often don’t.
- Which tier(s) of the spectrum? → § The Compression Spectrum
- Trigger strategy — fixed threshold, autonomous, task-boundary, or hybrid? → § Trigger Strategies
- Preservation policy — what survives verbatim, what gets compacted, what drops? → § Preservation Policy
- Custom instructions — how to encode the policy as prompt text. → § Tuning With Custom Instructions
- Situation-specific extensions — multi-agent, caching, recovery? → § Design Extensions
- Measurement — how do you verify it works? → § Measuring Compaction
Read the rest of this page in order; each section builds on the previous.
The Compression Spectrum
Compaction is not one operation — it is a spectrum of techniques with sharply different cost-fidelity tradeoffs. Pick the lightest one that works; escalate only when it doesn’t.
| Technique | What it does | Fidelity loss | Compute cost |
|---|---|---|---|
| Tool result clearing | Drop content of already-consumed tool outputs | Lossless if the agent moved past it | None |
| Tool result truncation | Cap tool output at N chars; keep head + tail | Lossy — middle detail gone | None |
| Round-level replacement | Replace older turns with a pre-computed summary | Lossy — summary captures gist | Paid once, at write |
| Full-conversation summarization | LLM re-reads whole conversation, writes a new starting point | Most lossy — narrative end-to-end compressed | Full LLM call |
Which Tier When
Which tiers you enable depends on how long and how tool-heavy your agent is:
- Short conversations, modest tool use — tier 1 (clearing) alone is often enough.
- Long conversations, moderate tool use — tiers 1 + 2 (clearing + truncation).
- Long conversations, dense tool output — tiers 1 + 2 + 3 (add round-level replacement).
- Very long sessions that repeatedly approach the context limit — all four tiers, with full summarization as last resort.
Starting with the lightest tiers and escalating when needed is cheaper than always running full summarization. But don’t wait until the last moment: heavier tiers need room to work, and triggering at 98% leaves little slack for the summarizer itself.
Selective vs Uniform Truncation
Truncating every tool output at the same character count is cheap but wasteful. A 50-line file read can safely be kept verbatim; a 50,000-line database dump needs aggressive trimming. Tune per tool: let each tool declare its own truncation policy (keep head, keep tail, keep summary).
The Recall-First, Precision-Second Recipe
Start by maximizing recall — ensure the compaction prompt captures every relevant piece of information. Then iterate to improve precision — eliminate redundant tool outputs and messages. — Effective Context Engineering for AI Agents, Anthropic, 2026
Build your compaction so it errs on the side of preserving too much. Once you see what the agent actually re-reads post-compaction, trim. Doing this in reverse — starting tight, loosening only when you notice loss — fails because you won’t notice the loss until a user complains weeks later.
Trigger Strategies
The hardest question is not how to compress but when. Four strategies cover the space.
Fixed Threshold
Compaction fires when total tokens cross a fixed ratio of the context window (commonly 60–85%) or an absolute value (Anthropic’s API defaults to 150,000 tokens, minimum 50,000).
- Strengths: predictable, easy to reason about, no model judgment involved.
- Weaknesses: may trigger at a bad moment — mid-reasoning, mid-tool-chain, or right when the agent was about to finish. The agent loses continuity because the trigger was token-count-based, not task-aware.
Autonomous Triggering
The agent decides when to compress. LangChain’s Deep Agents and Claude Code both offer this shape. The agent typically fires compaction at:
-
Task transitions — after completing a sub-goal
-
Post-extraction — right after extracting results from a long document or tool output
-
Pre-ingestion — before pulling in a large new context
-
Multi-step boundaries — before starting a refactor, migration, or analysis
-
Strengths: compression happens at moments where little state is lost. The agent picks the gap between logical units of work.
-
Weaknesses: model judgment varies. Agents can under-compress (wait too long, hit the ceiling) or over-compress (compress so often the summary-of-summary degrades). Needs tuning.
Task-Boundary Compaction
The harness, not the agent or a ratio, declares compaction points. Each workflow stage ends with compaction; the summary becomes the input to the next stage. Pipelines, multi-agent handoffs, and structured workflows use this.
- Strengths: compaction is part of the architecture, not an emergency response. No surprise triggers, no judgment calls, clean seams between stages.
- Weaknesses: only works if the workflow has natural seams. Open-ended agent loops don’t.
None (Crash at the Limit)
Worth naming because many early agents had this — no compaction, context fills, model errors or silently truncates. The only mitigation is “keep conversations short”. Acceptable for short-lived agents; unacceptable for anything that runs past a few turns.
Hybrid Is the Production Default
Most production systems combine fixed threshold as safety net + autonomous or task-boundary as primary. The threshold catches cases where the primary didn’t fire in time; the primary avoids the worst-moment triggers that fixed thresholds are famous for.
Preservation Policy
Once you know which tier and when it fires, the substantive design question is: what survives? This decision affects agent coherence post-compaction more than any other choice.
Three tiers of preservation:
Always Preserve Verbatim
- The user’s current-task turn (most recent N rounds, typically 3–10)
- Currently-open files with their latest state
- In-flight tool calls that haven’t returned yet
- The system prompt and memory index — these aren’t part of the compacted region, but naming them reminds you not to accidentally compact them
Preserve as Compact References
- Decisions that constrain future behavior — keep the decision, drop the deliberation
- Files that were read but not modified — keep the paths, not the content
- User intent — the original ask and key clarifications, compressed but prominent
- Architectural commitments — technology choices, interface contracts
- Unresolved issues — bugs found but not fixed, open questions
Drop Freely
- Tool output that was read and synthesized into a decision
- Agent’s own intermediate reasoning that led to a committed decision
- Exploration branches that were abandoned
- Routine acknowledgments — “OK, let me check…”
- Superseded decisions — older versions of a plan overwritten by newer ones
Where the Line Sits
The boundary between “always preserve” and “compact reference” is where most compaction bugs live. A safe heuristic:
- Anything the user would visibly notice the loss of → verbatim tier
- Anything the agent needs to remember but can reconstruct if needed → compact reference
- Everything else → droppable
Tuning With Custom Instructions
Your preservation policy is what survives. Custom instructions are the how — the actual prompt text that encodes the policy for the summarizer LLM. Default compaction prompts optimize for a generic conversation and often get your use case wrong: dropping architectural decisions, keeping routine tool exchanges, forgetting user-stated preferences.
Every serious compaction system exposes a way to replace or augment the default prompt:
- Anthropic API —
instructionsparameter completely replaces the default prompt - Claude Code —
/compact "focus on the recent database migration"appends user guidance - LangChain Deep Agents — middleware-level configuration of the summarization prompt
- OpenAI Agents SDK — custom summarizer function in the session configuration
Writing Good Custom Instructions
Translate each preservation tier from the previous section into prompt text:
- Verbatim tier — list explicitly: “Preserve the original user request verbatim. Keep the last N rounds unchanged. Preserve current file-edit state.”
- Compact-reference tier — describe the compression: “Summarize architectural decisions as one sentence each. List unresolved issues with one line per issue. Keep file paths even when content is compressed away.”
- Drop tier — state what to discard: “Omit routine acknowledgments, abandoned exploration branches, and tool outputs that have been synthesized into committed decisions.”
Keep the prompt short and prescriptive. Long summarization prompts with many nested rules tend to confuse smaller summarizer models; a clean bullet list of what-to-keep and what-to-drop performs better than paragraphs of guidance.
Verify With Replay
Before trusting a custom compaction prompt in production, test it by replay: take a known-good long task, force compaction at varying points, continue execution from the compacted state, and check whether the continued work matches the uncompacted baseline. Failures here are the signal that the prompt is dropping something it shouldn’t.
This is the same loop formalized in § Measuring Compaction — mentioned here because it is the verification step for every change you make to custom instructions, preservation policy, or trigger strategy.
Design Extensions
The basic design (spectrum + triggers + preservation + custom instructions) handles single-agent setups without caching or long-horizon recovery requirements. The three extensions below apply when your situation has the specific conditions named — not every agent needs them.
Multi-Agent Coordination
Applies when: multiple agents share context (common conversation, shared state) or their compaction decisions affect each other.
Each agent has its own context window. Compaction design has to decide: does each agent compress independently, or is there a coordinator?
Distributed compaction (common default) — each agent compacts independently. Simple, no coordinator needed. Works well when agents have mostly independent contexts (main delegates to sub-agent, sub-agent returns summary).
- Downside: if agents share substantial context, each duplicates the compression work and may compress to slightly different summaries, causing drift.
Centralized compaction (AutoGen’s pattern) — one coordinator compresses the shared conversation and
broadcasts. AutoGen’s CompressibleGroupManager is the published example.
- Upside: single source of truth. All agents agree on what was said and what it means.
- Downside: requires a coordinator role; becomes a bottleneck under load.
Guidance: independent agents → distributed. Shared conversation (collaborative editing, shared thread) → centralized. Parent-child with clean handoff → distributed (parent sees only the child’s summary anyway).
Prompt Cache Integration
Applies when: you’re using prompt caching (you probably are — caching is typically 10× cheaper on hits).
Compaction and caching interact subtly. Caches hit only on a stable prefix. Every compaction event replaces content, which can invalidate prefixes.
Three patterns to respect:
- Protect the cached head. The most stable content (tools, system prompt, durable examples) lives before the compacted region. Compaction replaces history after the head, not the head itself.
- Cache the summary itself. Anthropic’s API lets you place
cache_controlon the compaction block — the summary text gets cached on write, so the next call reads it cheaply. - Don’t write compaction output before the cached region. If the implementation prepends compaction output as “new context”, every compaction event breaks caching. Compaction must write after the cached prefix.
Resumability and Re-Compaction
Applies when: your agent runs long enough that failure recovery matters, or a single conversation may be compacted multiple times.
Compaction is lossy. If execution fails after compaction but before task completion, can you recover?
Three mechanisms:
- Pre-compaction logging — before compaction fires, log the full pre-compaction state to durable storage. On failure, reload and try a different strategy.
- Compaction checkpoints — the compaction output includes a reference to the pre-compacted state. Resumption loads the summary but keeps the reference for re-expansion if needed.
- Parallel recovery channels — an independent, always-appended artifact (memory, notes, auto-memory) captures key decisions outside the compacted region. Claude Code’s Auto Memory is an instance.
Designing for re-compaction: at some point the compacted summary itself will be re-compacted (summary-of-summary). Each pass loses fidelity. Anticipate:
- Cap the compaction count on any given conversation; beyond N, start a new session with explicit hand-off.
- Preserve a “core identity” region that never gets re-summarized — user intent, architectural decisions.
- Weight quality metrics differently for older, multi-compressed summaries.
Measuring Compaction
Compaction is high-risk: lossy operations on long-running state. It rewards careful measurement more than most context-engineering topics.
Five signals to track:
| Signal | What it tells you |
|---|---|
| Compression ratio | Input tokens / summary tokens. Too high = over-compression. Too low = wasted work. |
| Trigger precision | Fraction of compactions that were actually needed. Low = triggering too early. |
| Post-compaction regression | Force compaction mid-task and replay. Tasks that fail only post-compaction identify what the compactor should preserve. |
| Summary-of-summary degradation | How does fidelity drop on the Nth re-compaction? If steep, cap the re-compaction count. |
| Compaction cost amortization | Summary call cost ÷ turns-until-next-compaction. Helps decide whether each tier is worth its compute. |
Synthetic Regression Suite
The single most valuable test: keep 10–20 long tasks that historically succeeded. Force compaction at fixed and variable points. Replay them. Flag any regression.
Run this suite whenever you change compaction logic — custom instructions, trigger strategy, preservation policy. It catches more problems than any other measurement approach.
Cross-Framework Reference
A survey of how the major frameworks expose compaction. Useful for calibration against industry practice; not a prescription.
| Framework | Trigger style | Preservation | Customization | Multi-agent | Recovery |
|---|---|---|---|---|---|
| Anthropic API | Fixed threshold (configurable) | Automatic or manual pause_after | instructions parameter | Per-conversation | compaction block + cache_control |
| Claude Code | Auto + /compact command | Recent turns + auto memory | /compact "focus on ..." | Per-session | Auto Memory parallel channel |
| LangChain Deep | Autonomous (model decides) | Recent 10% + middleware rules | Summarization middleware | Per-agent | Virtual filesystem of history |
| OpenAI Agents SDK | Trim (drop) or summarize | Last N turns verbatim | Custom summarizer function | Per-session | Session store |
| CrewAI | respect_context_window | summarize_messages() chunks | Limited | Per-agent | Shared memory class |
| AutoGen | Per-manager | Shared conversation compression | Group manager configuration | Centralized (unique) | Delegated to coordinator |
Design Bets Each Framework Represents
No framework is “best” — they reflect different design choices:
- AutoGen bets on coordinated multi-agent — worth it when agents share context heavily
- LangChain bets on agent judgment — worth it for autonomous agents with variable workloads
- OpenAI bets on simplicity — two options (trim / summarize), clear mental model
- Anthropic bets on configurable server-side primitives — infrastructure, not policy
- CrewAI bets on one-setting simplicity — opinionated defaults, no choice paralysis
- Claude Code bets on developer-in-the-loop —
/compactwith user instructions, less autonomy
If you’re choosing a framework with compaction in mind, name which bet matches your agent shape first; the rest follows.
Related Reading
- ← Overview — Return to the section hub.
- Context Management — The rest of the runtime discipline: attention budgets, just-in-time context, sub-agent isolation, checkpointing. Compaction fires when those techniques hit their limits.
- Memory Design — Structured notes and memory complement compaction: what persists outside the window doesn’t need to be compressed, only retrieved.
- From Case to Paradigm → Part 2 — “Where the Method Stops Scaling” names when compaction becomes a design-essential rather than an emergency response.
Sources
- Compaction — Anthropic, Claude API docs
(beta
compact-2026-01-12) - Automatic context compaction — Anthropic, Claude Cookbook
- Autonomous Context Compression — LangChain, 2026
- Context Engineering for Deep Agents — LangChain Docs
- Context Engineering — Short-Term Memory Management with Sessions — OpenAI Agents SDK Cookbook
- Memory — CrewAI Concepts
- Memory and RAG — AutoGen
- Effective Context Engineering for AI Agents — Anthropic, 2026