Context Management
The runtime discipline — attention budgets, the compaction spectrum, just-in-time context loading, progressive disclosure, and sub-agents as a context-engineering tool
The Runtime Problem
Prompt design fixes a static skeleton. Memory design decides what lives outside the window. Compaction handles the case when the window overflows anyway. Context management is everything that happens around those — the runtime discipline of deciding, turn after turn, what should be in the window right now: which inputs to load, which to prune, when to delegate, how to recover, and how to notice when something has gone wrong.
The hard part is that the right answer changes constantly:
- A tool result that was relevant at turn 3 is noise at turn 30.
- A file the agent just edited is critical at turn 10 and forgettable at turn 20.
- A sub-agent’s full trace matters while it runs and becomes clutter the moment it returns a summary.
A good context manager solves a sequential decision problem: given the current state of the window, what should be kept, pruned, loaded, or summarized before the next model call? Getting this right is what separates an agent that stays coherent over 100 turns from one that loses the thread at turn 15.
The Attention Budget
Treat the context window as a budget with a soft ceiling, not a hard limit. The ceiling is set not by the model’s documented context size, but by the point where measurable quality starts to degrade — typically well below the advertised maximum.
A workable rule:
Define a watermark — roughly 40-60% of the usable context size. When the window exceeds the watermark, the context manager acts. Below the watermark, leave everything alone.
Why a watermark instead of compacting every turn:
- Compaction costs tokens (if it uses an LLM) or information (if it is a simple truncation). Cheap when rare, wasteful when constant.
- The cheapest context operation is no operation. Don’t pay for management the agent doesn’t need yet.
- Some tasks finish before reaching the watermark — their “optimal compaction” is zero.
The budget isn’t just the window. It’s also:
- Token cost — cache hits are 10× cheaper than writes, so keeping the stable prefix intact has real dollar value.
- Latency — every compaction that calls an LLM is a round trip the user waits on.
- Quality — any lossy operation is a bet that what you drop is worth less than what you keep.
The Role of Compaction Here
When the watermark is crossed, the runtime’s answer is compaction — the operation covered in detail on the previous page. This page doesn’t re-teach it; it covers everything else the runtime does to keep the agent inside the budget so compaction doesn’t have to fire as often: loading lazily (JIT), isolating sub-agent work, checkpointing durable state, and catching failure modes early.
Compaction is the last-resort operation. The techniques that follow are about not needing the last resort.
Just-in-Time Context
Rather than pre-loading everything potentially relevant, a well-designed agent maintains lightweight identifiers and loads content only when needed.
Agents maintain lightweight identifiers (file paths, stored queries, web links) and dynamically load data into context at runtime using tools. — Effective Context Engineering for AI Agents
This mirrors how humans work. You don’t memorize the entire codebase; you remember where to look. A good agent holds pointers, not payloads.
Examples:
- Instead of
read_file("large_config.json")at task start, the agent discovers the path throughgloband reads only if needed. - Instead of dumping all previous chat history, the agent’s memory index lists titles; it fetches content only for relevant entries.
- Instead of loading every file in a module, the agent
greps for the symbol and reads only the matches.
Tool design for JIT
For just-in-time loading to work, tools must return metadata before content:
globreturns paths, sizes, timestamps — not contentsgrepstreams matching lines — not whole filesread_fileaccepts offsets and limits — not just “read everything”searchreturns ranked snippets — not full documents
Tools that can’t stage their output (they always return the full thing) force the agent into either pre-loading or expensive compaction. When you design a tool, ask: “can the agent act on metadata alone for the common case?” If yes, return metadata first and let the agent drill down.
Progressive disclosure
The loading pattern that falls out of good JIT tool design is progressive disclosure: the agent’s understanding builds layer by layer, pulling in detail only when a decision requires it.
- File sizes suggest complexity
- Naming conventions suggest purpose
- Timestamps suggest relevance
- Directory structure suggests architecture
At each layer, the agent decides whether to descend further or pivot. The context stays small because most branches never get explored.
Hybrid strategy
Pure JIT is impractical because some context is always worth loading. A pure-JIT agent would re-discover the project’s conventions on every task. The usual sweet spot is hybrid:
- Pre-computed retrieval for known-important context — a
CLAUDE.mdor project-brief file loaded at startup. Small, stable, always relevant. - Autonomous exploration via tools — everything else, loaded on demand.
The pre-loaded portion should be small and genuinely universal. If a document is only relevant to 20% of tasks, it belongs in the JIT layer, not the always-on layer.
Sub-Agents as Context Tool
Delegation is often framed as an orchestration pattern. It is also — perhaps primarily — a context engineering technique.
A sub-agent runs with a clean context window. It receives a delegation prompt, performs its work (possibly dozens of its own tool calls), and returns a single condensed result — typically 1-2k tokens. The parent’s window sees only the delegation and the summary; the sub-agent’s intermediate trace never enters the parent’s context.
This matters because agents that run long internal loops — search, extract, click, search, extract — generate huge amounts of intermediate context that is meaningful to them but noise to the parent. Keeping that trace out of the parent’s window preserves the parent’s attention for the task that actually needs its judgment.
When to reach for delegation:
- Deep, focused exploration — “find all references to X across the codebase” generates many file reads the parent will never re-read
- Multi-step loops with local state — browser navigation, database exploration, iterative search
- Parallelizable work — fanning out across items, with each result independently useful
Designing the delegation
The quality of a sub-agent result depends almost entirely on the quality of the delegation prompt. Good delegation prompts:
- State the goal, not the procedure (“find the function that handles X” — not “grep for Y then read Z”)
- Specify the return contract — shape, length, and what to include
- List known constraints — “don’t modify files”, “budget: 10 tool calls”, “stop at first match”
- Include any context the sub-agent can’t derive — relevant file paths, hypotheses to try
Bad delegation prompts offload the thinking without framing the problem. The sub-agent thrashes, returns a dump, and the parent has to re-read it to extract the answer — erasing the savings.
Cost of delegation
Sub-agents are not free:
- The delegation prompt is paid for on top of the parent’s turn
- The sub-agent’s own tool calls are billed
- Coordination (did the sub-agent actually understand the task?) adds latency
Use delegation when the saved context is worth more than the extra tokens. As a rough heuristic: for a task of a handful of steps, direct execution is cheaper; for a task of dozens of steps with a narrowly defined output, delegation is almost always cheaper. The break-even point depends on model prices, cache hit rates, and how much of the sub-agent’s work would have polluted the parent’s window — measure before committing to a default.
Checkpointing and Resumability
Long-running agents span more than one continuous session. A user may close the tab, a connection may drop, a process may crash. The context manager must make resumption cheap — ideally O(summary size), not O(full history).
The key insight: the full conversation history is never what you actually need to resume. You need the state at the last consistent checkpoint: compaction summaries, the working set of files, pending todos. That state is far smaller than the raw history.
A workable checkpoint shape:
| Element | Why it’s needed |
|---|---|
| Compaction summary | Replaces the summarized portion of history |
| Uncompressed tail | The most recent N rounds, kept verbatim |
| Open todos | What the agent was working on |
| Working file set | Files in scope for the current task (paths, not contents) |
| Memory index | Titles of persisted memory available for recall |
Resume means: load the checkpoint, reconstruct the context from it, and continue. The user perceives no discontinuity because everything essential is there — even though the raw message history was never replayed.
This mechanism interacts with compaction: when compaction runs, it should also update the checkpoint, so resumption never needs to reconstruct compaction state from scratch. Treat checkpoint + compaction state as the durable description of “where the agent is”, and the message history as an append-only log on top.
Failure Modes
Runtime problems — ways the context-management loop breaks even when prompt and memory are well-designed. For design-time mistakes in each pillar, see the “Anti-Patterns” section on the Prompt Design and Memory Design pages.
| Failure mode | Signal | Fix |
|---|---|---|
| Context bloat | Window past 70% full, agent starts missing earlier decisions | Earlier watermark; lighter compaction technique for late rounds |
| Over-compaction | Agent asks “what was I doing?” after a compaction event | Preserve more recent turns; preserve file-edit state verbatim |
| Skeleton drift | System prompt has grown to 15k tokens over time as rules accumulated | Audit skeleton; move infrequently-needed content to JIT |
| Sub-agent context leak | Sub-agent returns a full trace instead of a summary | Tighten return contract; shorter delegation prompt |
| JIT thrashing | Same file re-read 5 times in one turn | Keep a fetched-this-turn cache; refer to it by identifier |
| Stale checkpoint | Resume lands the agent in a state that no longer matches reality (files changed) | Verify file hashes on resume; re-read if stale |
Strategy Composition
No single technique handles long-running context alone. A well-engineered system uses each one where it’s most effective:
| Strategy | Best at | Pairs with |
|---|---|---|
| Tool clearing | Routine large tool outputs the agent has already processed | Every other technique — it’s the baseline |
| Compaction | Long back-and-forth conversations with natural round boundaries | Structured notes, so nothing critical is lost |
| Structured notes | Iterative development with clear milestones, cross-session resumability | Compaction, so the notes survive |
| Sub-agents | Deep focused work whose intermediate trace is noise to the parent | Memory, so sub-agent learnings persist |
| JIT loading | Large reference material the agent needs rarely | Good tool design |
These strategies compose. Compaction for conversation history, structured notes for cross-session state, sub-agents for deep research, JIT loading for most file reads, tool clearing as the always-on hygiene — each applied where it is most effective, with the attention budget as the common currency.
The Underlying Discipline
Every technique in this page — compaction, JIT, sub-agents, checkpointing — is an instance of one discipline:
Keep the window filled, and only filled, with what the agent needs to be working on right now.
This is the runtime translation of the overview’s guiding principle: the overview asks “what tokens”; this page asks “which tokens, this turn, and what leaves to make room”. Same idea, different axis. A prompt you wrote once gets re-interpreted by the runtime on every turn — context management is that re-interpretation discipline.
Not what the window might need. Not what it used to need. Not a safety net of “just in case” context. The closer the window is to that target, the better the agent performs. The further it drifts, the sooner it degrades.
Context management is the pillar that decides whether the static skeleton (prompt design) and the persistent store (memory design) actually pay off. Good prompts and good memory are necessary but not sufficient. Until the runtime consistently serves the right tokens at the right time, the agent’s measured performance will be a pale shadow of what the same prompts and memory could achieve.
Measuring It
Context management benefits from both per-turn metrics and per-task metrics:
- Context occupancy — track window fill over the lifetime of a task. A flat line near the watermark is healthy; a sawtooth climbing past it and crashing via compaction is a signal your watermark is set too high.
- Compaction frequency — how often does each tier fire per 100 turns? Heavy summarization firing often means earlier tiers aren’t doing enough.
- Cache hit rate — direct readout on whether caching-aware ordering is paying off. Aim for 80%+ on stable tasks.
- Recovery-from-checkpoint success — for resumable agents, what fraction of resumes land in a usable state? Below ~95% means the checkpoint shape is missing something.
- Post-compaction regression — synthetic test: take a successful long task, force compaction at varying points, replay. Tasks that fail only after compaction indicate what the compactor dropped that shouldn’t have.
One alarm worth wiring: delta between actual and model-reported tokens. Large drift means your token accounting is wrong and your watermark is meaningless.
Related Reading
- Prompt Design — Every compaction technique has to preserve the prompt’s stable prefix to keep caching intact. Compaction that invalidates the cache pays twice.
- Memory Design — The externalization strategy for information the context manager would otherwise have to preserve across compactions. Memory and compaction compose: durable things go to memory, ephemeral things get compacted.
Sources
- Effective Context Engineering for AI Agents — Anthropic, 2026
- Prompt caching — Anthropic, Claude API docs
- Effective Harnesses for Long-Running Agents — Anthropic, 2025
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023, the empirical basis for “context rot” and position-sensitive retrieval
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — Asai et al., 2023, related framing of just-in-time retrieval