Compaction

The operation that replaces window content with a more compact form — spectrum, triggers, preservation, custom-instruction tuning, design extensions, measurement, and a cross-framework reference

Why Compaction Matters

Long-running agents fill their context windows. Without compaction they either crash at the hard limit or degrade gradually (context rot) as the window grows past the model’s effective working range. Compaction is the operation that replaces some of the window’s content with a more compact representation so the agent can keep working.

Every other technique in this section — caching, just-in-time loading, sub-agent isolation — reduces how much context enters the window. Compaction is what you do when it enters anyway.

Design Decisions at a Glance

Seven decisions shape a compaction design, roughly in this order:

  1. Do you need compaction? — usage profile. Short tasks often don’t.
  2. Which tier(s) of the spectrum?§ The Compression Spectrum
  3. Trigger strategy — fixed threshold, autonomous, task-boundary, or hybrid? → § Trigger Strategies
  4. Preservation policy — what survives verbatim, what gets compacted, what drops? → § Preservation Policy
  5. Custom instructions — how to encode the policy as prompt text. → § Tuning With Custom Instructions
  6. Situation-specific extensions — multi-agent, caching, recovery? → § Design Extensions
  7. Measurement — how do you verify it works? → § Measuring Compaction

Read the rest of this page in order; each section builds on the previous.


The Compression Spectrum

Compaction is not one operation — it is a spectrum of techniques with sharply different cost-fidelity tradeoffs. Pick the lightest one that works; escalate only when it doesn’t.

The Compaction Spectrum Four compaction techniques — use the lightest one that works; each next tier is a last resort Fidelity loss Technique Compute cost ① Tool Result Clearing Drop the content of already-consumed tool outputs — pointer survives, payload goes Lossless if the agent has already moved past that turn None O(1) when clearing isn't enough ② Tool Result Truncation Replace tool output with a short summary — cap at N chars / keep first & last Lossy — tail-of-output detail is gone Low O(1) when truncation isn't enough ③ Round-Level Replacement Replace older conversation rounds with a pre-computed summary Lossy — summary captures gist only Medium Paid once, at write last resort ④ Full-Conversation Summarization LLM re-reads whole conversation, writes a fresh starting point Most lossy — narrative end-to-end compressed High Full LLM call

TechniqueWhat it doesFidelity lossCompute cost
Tool result clearingDrop content of already-consumed tool outputsLossless if the agent moved past itNone
Tool result truncationCap tool output at N chars; keep head + tailLossy — middle detail goneNone
Round-level replacementReplace older turns with a pre-computed summaryLossy — summary captures gistPaid once, at write
Full-conversation summarizationLLM re-reads whole conversation, writes a new starting pointMost lossy — narrative end-to-end compressedFull LLM call

Which Tier When

Which tiers you enable depends on how long and how tool-heavy your agent is:

  • Short conversations, modest tool use — tier 1 (clearing) alone is often enough.
  • Long conversations, moderate tool use — tiers 1 + 2 (clearing + truncation).
  • Long conversations, dense tool output — tiers 1 + 2 + 3 (add round-level replacement).
  • Very long sessions that repeatedly approach the context limit — all four tiers, with full summarization as last resort.

Starting with the lightest tiers and escalating when needed is cheaper than always running full summarization. But don’t wait until the last moment: heavier tiers need room to work, and triggering at 98% leaves little slack for the summarizer itself.

Selective vs Uniform Truncation

Truncating every tool output at the same character count is cheap but wasteful. A 50-line file read can safely be kept verbatim; a 50,000-line database dump needs aggressive trimming. Tune per tool: let each tool declare its own truncation policy (keep head, keep tail, keep summary).

The Recall-First, Precision-Second Recipe

Start by maximizing recall — ensure the compaction prompt captures every relevant piece of information. Then iterate to improve precision — eliminate redundant tool outputs and messages. — Effective Context Engineering for AI Agents, Anthropic, 2026

Build your compaction so it errs on the side of preserving too much. Once you see what the agent actually re-reads post-compaction, trim. Doing this in reverse — starting tight, loosening only when you notice loss — fails because you won’t notice the loss until a user complains weeks later.


Trigger Strategies

The hardest question is not how to compress but when. Four strategies cover the space.

Fixed Threshold

Compaction fires when total tokens cross a fixed ratio of the context window (commonly 60–85%) or an absolute value (Anthropic’s API defaults to 150,000 tokens, minimum 50,000).

  • Strengths: predictable, easy to reason about, no model judgment involved.
  • Weaknesses: may trigger at a bad moment — mid-reasoning, mid-tool-chain, or right when the agent was about to finish. The agent loses continuity because the trigger was token-count-based, not task-aware.

Autonomous Triggering

The agent decides when to compress. LangChain’s Deep Agents and Claude Code both offer this shape. The agent typically fires compaction at:

  • Task transitions — after completing a sub-goal

  • Post-extraction — right after extracting results from a long document or tool output

  • Pre-ingestion — before pulling in a large new context

  • Multi-step boundaries — before starting a refactor, migration, or analysis

  • Strengths: compression happens at moments where little state is lost. The agent picks the gap between logical units of work.

  • Weaknesses: model judgment varies. Agents can under-compress (wait too long, hit the ceiling) or over-compress (compress so often the summary-of-summary degrades). Needs tuning.

Task-Boundary Compaction

The harness, not the agent or a ratio, declares compaction points. Each workflow stage ends with compaction; the summary becomes the input to the next stage. Pipelines, multi-agent handoffs, and structured workflows use this.

  • Strengths: compaction is part of the architecture, not an emergency response. No surprise triggers, no judgment calls, clean seams between stages.
  • Weaknesses: only works if the workflow has natural seams. Open-ended agent loops don’t.

None (Crash at the Limit)

Worth naming because many early agents had this — no compaction, context fills, model errors or silently truncates. The only mitigation is “keep conversations short”. Acceptable for short-lived agents; unacceptable for anything that runs past a few turns.

Hybrid Is the Production Default

Most production systems combine fixed threshold as safety net + autonomous or task-boundary as primary. The threshold catches cases where the primary didn’t fire in time; the primary avoids the worst-moment triggers that fixed thresholds are famous for.


Preservation Policy

Once you know which tier and when it fires, the substantive design question is: what survives? This decision affects agent coherence post-compaction more than any other choice.

Three tiers of preservation:

Always Preserve Verbatim

  • The user’s current-task turn (most recent N rounds, typically 3–10)
  • Currently-open files with their latest state
  • In-flight tool calls that haven’t returned yet
  • The system prompt and memory index — these aren’t part of the compacted region, but naming them reminds you not to accidentally compact them

Preserve as Compact References

  • Decisions that constrain future behavior — keep the decision, drop the deliberation
  • Files that were read but not modified — keep the paths, not the content
  • User intent — the original ask and key clarifications, compressed but prominent
  • Architectural commitments — technology choices, interface contracts
  • Unresolved issues — bugs found but not fixed, open questions

Drop Freely

  • Tool output that was read and synthesized into a decision
  • Agent’s own intermediate reasoning that led to a committed decision
  • Exploration branches that were abandoned
  • Routine acknowledgments — “OK, let me check…”
  • Superseded decisions — older versions of a plan overwritten by newer ones

Where the Line Sits

The boundary between “always preserve” and “compact reference” is where most compaction bugs live. A safe heuristic:

  • Anything the user would visibly notice the loss of → verbatim tier
  • Anything the agent needs to remember but can reconstruct if needed → compact reference
  • Everything else → droppable

Tuning With Custom Instructions

Your preservation policy is what survives. Custom instructions are the how — the actual prompt text that encodes the policy for the summarizer LLM. Default compaction prompts optimize for a generic conversation and often get your use case wrong: dropping architectural decisions, keeping routine tool exchanges, forgetting user-stated preferences.

Every serious compaction system exposes a way to replace or augment the default prompt:

  • Anthropic APIinstructions parameter completely replaces the default prompt
  • Claude Code/compact "focus on the recent database migration" appends user guidance
  • LangChain Deep Agents — middleware-level configuration of the summarization prompt
  • OpenAI Agents SDK — custom summarizer function in the session configuration

Writing Good Custom Instructions

Translate each preservation tier from the previous section into prompt text:

  • Verbatim tier — list explicitly: “Preserve the original user request verbatim. Keep the last N rounds unchanged. Preserve current file-edit state.”
  • Compact-reference tier — describe the compression: “Summarize architectural decisions as one sentence each. List unresolved issues with one line per issue. Keep file paths even when content is compressed away.”
  • Drop tier — state what to discard: “Omit routine acknowledgments, abandoned exploration branches, and tool outputs that have been synthesized into committed decisions.”

Keep the prompt short and prescriptive. Long summarization prompts with many nested rules tend to confuse smaller summarizer models; a clean bullet list of what-to-keep and what-to-drop performs better than paragraphs of guidance.

Verify With Replay

Before trusting a custom compaction prompt in production, test it by replay: take a known-good long task, force compaction at varying points, continue execution from the compacted state, and check whether the continued work matches the uncompacted baseline. Failures here are the signal that the prompt is dropping something it shouldn’t.

This is the same loop formalized in § Measuring Compaction — mentioned here because it is the verification step for every change you make to custom instructions, preservation policy, or trigger strategy.


Design Extensions

The basic design (spectrum + triggers + preservation + custom instructions) handles single-agent setups without caching or long-horizon recovery requirements. The three extensions below apply when your situation has the specific conditions named — not every agent needs them.

Multi-Agent Coordination

Applies when: multiple agents share context (common conversation, shared state) or their compaction decisions affect each other.

Each agent has its own context window. Compaction design has to decide: does each agent compress independently, or is there a coordinator?

Distributed compaction (common default) — each agent compacts independently. Simple, no coordinator needed. Works well when agents have mostly independent contexts (main delegates to sub-agent, sub-agent returns summary).

  • Downside: if agents share substantial context, each duplicates the compression work and may compress to slightly different summaries, causing drift.

Centralized compaction (AutoGen’s pattern) — one coordinator compresses the shared conversation and broadcasts. AutoGen’s CompressibleGroupManager is the published example.

  • Upside: single source of truth. All agents agree on what was said and what it means.
  • Downside: requires a coordinator role; becomes a bottleneck under load.

Guidance: independent agents → distributed. Shared conversation (collaborative editing, shared thread) → centralized. Parent-child with clean handoff → distributed (parent sees only the child’s summary anyway).

Prompt Cache Integration

Applies when: you’re using prompt caching (you probably are — caching is typically 10× cheaper on hits).

Compaction and caching interact subtly. Caches hit only on a stable prefix. Every compaction event replaces content, which can invalidate prefixes.

Three patterns to respect:

  1. Protect the cached head. The most stable content (tools, system prompt, durable examples) lives before the compacted region. Compaction replaces history after the head, not the head itself.
  2. Cache the summary itself. Anthropic’s API lets you place cache_control on the compaction block — the summary text gets cached on write, so the next call reads it cheaply.
  3. Don’t write compaction output before the cached region. If the implementation prepends compaction output as “new context”, every compaction event breaks caching. Compaction must write after the cached prefix.

Resumability and Re-Compaction

Applies when: your agent runs long enough that failure recovery matters, or a single conversation may be compacted multiple times.

Compaction is lossy. If execution fails after compaction but before task completion, can you recover?

Three mechanisms:

  • Pre-compaction logging — before compaction fires, log the full pre-compaction state to durable storage. On failure, reload and try a different strategy.
  • Compaction checkpoints — the compaction output includes a reference to the pre-compacted state. Resumption loads the summary but keeps the reference for re-expansion if needed.
  • Parallel recovery channels — an independent, always-appended artifact (memory, notes, auto-memory) captures key decisions outside the compacted region. Claude Code’s Auto Memory is an instance.

Designing for re-compaction: at some point the compacted summary itself will be re-compacted (summary-of-summary). Each pass loses fidelity. Anticipate:

  • Cap the compaction count on any given conversation; beyond N, start a new session with explicit hand-off.
  • Preserve a “core identity” region that never gets re-summarized — user intent, architectural decisions.
  • Weight quality metrics differently for older, multi-compressed summaries.

Measuring Compaction

Compaction is high-risk: lossy operations on long-running state. It rewards careful measurement more than most context-engineering topics.

Five signals to track:

SignalWhat it tells you
Compression ratioInput tokens / summary tokens. Too high = over-compression. Too low = wasted work.
Trigger precisionFraction of compactions that were actually needed. Low = triggering too early.
Post-compaction regressionForce compaction mid-task and replay. Tasks that fail only post-compaction identify what the compactor should preserve.
Summary-of-summary degradationHow does fidelity drop on the Nth re-compaction? If steep, cap the re-compaction count.
Compaction cost amortizationSummary call cost ÷ turns-until-next-compaction. Helps decide whether each tier is worth its compute.

Synthetic Regression Suite

The single most valuable test: keep 10–20 long tasks that historically succeeded. Force compaction at fixed and variable points. Replay them. Flag any regression.

Run this suite whenever you change compaction logic — custom instructions, trigger strategy, preservation policy. It catches more problems than any other measurement approach.


Cross-Framework Reference

A survey of how the major frameworks expose compaction. Useful for calibration against industry practice; not a prescription.

FrameworkTrigger stylePreservationCustomizationMulti-agentRecovery
Anthropic APIFixed threshold (configurable)Automatic or manual pause_afterinstructions parameterPer-conversationcompaction block + cache_control
Claude CodeAuto + /compact commandRecent turns + auto memory/compact "focus on ..."Per-sessionAuto Memory parallel channel
LangChain DeepAutonomous (model decides)Recent 10% + middleware rulesSummarization middlewarePer-agentVirtual filesystem of history
OpenAI Agents SDKTrim (drop) or summarizeLast N turns verbatimCustom summarizer functionPer-sessionSession store
CrewAIrespect_context_windowsummarize_messages() chunksLimitedPer-agentShared memory class
AutoGenPer-managerShared conversation compressionGroup manager configurationCentralized (unique)Delegated to coordinator

Design Bets Each Framework Represents

No framework is “best” — they reflect different design choices:

  • AutoGen bets on coordinated multi-agent — worth it when agents share context heavily
  • LangChain bets on agent judgment — worth it for autonomous agents with variable workloads
  • OpenAI bets on simplicity — two options (trim / summarize), clear mental model
  • Anthropic bets on configurable server-side primitives — infrastructure, not policy
  • CrewAI bets on one-setting simplicity — opinionated defaults, no choice paralysis
  • Claude Code bets on developer-in-the-loop/compact with user instructions, less autonomy

If you’re choosing a framework with compaction in mind, name which bet matches your agent shape first; the rest follows.


  • Overview — Return to the section hub.
  • Context Management — The rest of the runtime discipline: attention budgets, just-in-time context, sub-agent isolation, checkpointing. Compaction fires when those techniques hit their limits.
  • Memory Design — Structured notes and memory complement compaction: what persists outside the window doesn’t need to be compressed, only retrieved.
  • From Case to Paradigm → Part 2 — “Where the Method Stops Scaling” names when compaction becomes a design-essential rather than an emergency response.

Sources

Was this page helpful?