Context Management

The Runtime Problem

Prompt design fixes a static skeleton. Memory design decides what lives outside the window. Compaction handles the case when the window overflows anyway. Context management is everything that happens around those — the runtime discipline of deciding, turn after turn, what should be in the window right now: which inputs to load, which to prune, when to delegate, how to recover, and how to notice when something has gone wrong.

The hard part is that the right answer changes constantly:

A tool result that was relevant at turn 3 is noise at turn 30.
A file the agent just edited is critical at turn 10 and forgettable at turn 20.
A sub-agent’s full trace matters while it runs and becomes clutter the moment it returns a summary.

A good context manager solves a sequential decision problem: given the current state of the window, what should be kept, pruned, loaded, or summarized before the next model call? Getting this right is what separates an agent that stays coherent over 100 turns from one that loses the thread at turn 15.

The Attention Budget

Treat the context window as a budget with a soft ceiling, not a hard limit. The ceiling is set not by the model’s documented context size, but by the point where measurable quality starts to degrade — typically well below the advertised maximum.

A workable rule:

Define a watermark — roughly 40-60% of the usable context size. When the window exceeds the watermark, the context manager acts. Below the watermark, leave everything alone.

Why a watermark instead of compacting every turn:

Compaction costs tokens (if it uses an LLM) or information (if it is a simple truncation). Cheap when rare, wasteful when constant.
The cheapest context operation is no operation. Don’t pay for management the agent doesn’t need yet.
Some tasks finish before reaching the watermark — their “optimal compaction” is zero.

The budget isn’t just the window. It’s also:

Token cost — cache hits are 10× cheaper than writes, so keeping the stable prefix intact has real dollar value.
Latency — every compaction that calls an LLM is a round trip the user waits on.
Quality — any lossy operation is a bet that what you drop is worth less than what you keep.

The Role of Compaction Here

When the watermark is crossed, the runtime’s answer is compaction — the operation covered in detail on the previous page. This page doesn’t re-teach it; it covers everything else the runtime does to keep the agent inside the budget so compaction doesn’t have to fire as often: loading lazily (JIT), isolating sub-agent work, checkpointing durable state, and catching failure modes early.

Compaction is the last-resort operation. The techniques that follow are about not needing the last resort.

Just-in-Time Context

Rather than pre-loading everything potentially relevant, a well-designed agent maintains lightweight identifiers and loads content only when needed.

Agents maintain lightweight identifiers (file paths, stored queries, web links) and dynamically load data into context at runtime using tools. — Effective Context Engineering for AI Agents

This mirrors how humans work. You don’t memorize the entire codebase; you remember where to look. A good agent holds pointers, not payloads.

Examples:

Instead of read_file("large_config.json") at task start, the agent discovers the path through glob and reads only if needed.
Instead of dumping all previous chat history, the agent’s memory index lists titles; it fetches content only for relevant entries.
Instead of loading every file in a module, the agent greps for the symbol and reads only the matches.

Tool design for JIT

For just-in-time loading to work, tools must return metadata before content:

glob returns paths, sizes, timestamps — not contents
grep streams matching lines — not whole files
read_file accepts offsets and limits — not just “read everything”
search returns ranked snippets — not full documents

Tools that can’t stage their output (they always return the full thing) force the agent into either pre-loading or expensive compaction. When you design a tool, ask: “can the agent act on metadata alone for the common case?” If yes, return metadata first and let the agent drill down.

Progressive disclosure

The loading pattern that falls out of good JIT tool design is progressive disclosure: the agent’s understanding builds layer by layer, pulling in detail only when a decision requires it.

File sizes suggest complexity
Naming conventions suggest purpose
Timestamps suggest relevance
Directory structure suggests architecture

At each layer, the agent decides whether to descend further or pivot. The context stays small because most branches never get explored.

Hybrid strategy

Pure JIT is impractical because some context is always worth loading. A pure-JIT agent would re-discover the project’s conventions on every task. The usual sweet spot is hybrid:

Pre-computed retrieval for known-important context — a CLAUDE.md or project-brief file loaded at startup. Small, stable, always relevant.
Autonomous exploration via tools — everything else, loaded on demand.

The pre-loaded portion should be small and genuinely universal. If a document is only relevant to 20% of tasks, it belongs in the JIT layer, not the always-on layer.

Sub-Agents as Context Tool

Delegation is often framed as an orchestration pattern. It is also — perhaps primarily — a context engineering technique.

A sub-agent runs with a clean context window. It receives a delegation prompt, performs its work (possibly dozens of its own tool calls), and returns a single condensed result — typically 1-2k tokens. The parent’s window sees only the delegation and the summary; the sub-agent’s intermediate trace never enters the parent’s context.

This matters because agents that run long internal loops — search, extract, click, search, extract — generate huge amounts of intermediate context that is meaningful to them but noise to the parent. Keeping that trace out of the parent’s window preserves the parent’s attention for the task that actually needs its judgment.

When to reach for delegation:

Deep, focused exploration — “find all references to X across the codebase” generates many file reads the parent will never re-read
Multi-step loops with local state — browser navigation, database exploration, iterative search
Parallelizable work — fanning out across items, with each result independently useful

Designing the delegation

The quality of a sub-agent result depends almost entirely on the quality of the delegation prompt. Good delegation prompts:

State the goal, not the procedure (“find the function that handles X” — not “grep for Y then read Z”)
Specify the return contract — shape, length, and what to include
List known constraints — “don’t modify files”, “budget: 10 tool calls”, “stop at first match”
Include any context the sub-agent can’t derive — relevant file paths, hypotheses to try

Bad delegation prompts offload the thinking without framing the problem. The sub-agent thrashes, returns a dump, and the parent has to re-read it to extract the answer — erasing the savings.

Cost of delegation

Sub-agents are not free:

The delegation prompt is paid for on top of the parent’s turn
The sub-agent’s own tool calls are billed
Coordination (did the sub-agent actually understand the task?) adds latency

Use delegation when the saved context is worth more than the extra tokens. As a rough heuristic: for a task of a handful of steps, direct execution is cheaper; for a task of dozens of steps with a narrowly defined output, delegation is almost always cheaper. The break-even point depends on model prices, cache hit rates, and how much of the sub-agent’s work would have polluted the parent’s window — measure before committing to a default.

Checkpointing and Resumability

Long-running agents span more than one continuous session. A user may close the tab, a connection may drop, a process may crash. The context manager must make resumption cheap — ideally O(summary size), not O(full history).

The key insight: the full conversation history is never what you actually need to resume. You need the state at the last consistent checkpoint: compaction summaries, the working set of files, pending todos. That state is far smaller than the raw history.

A workable checkpoint shape:

Element	Why it’s needed
Compaction summary	Replaces the summarized portion of history
Uncompressed tail	The most recent N rounds, kept verbatim
Open todos	What the agent was working on
Working file set	Files in scope for the current task (paths, not contents)
Memory index	Titles of persisted memory available for recall

Resume means: load the checkpoint, reconstruct the context from it, and continue. The user perceives no discontinuity because everything essential is there — even though the raw message history was never replayed.

This mechanism interacts with compaction: when compaction runs, it should also update the checkpoint, so resumption never needs to reconstruct compaction state from scratch. Treat checkpoint + compaction state as the durable description of “where the agent is”, and the message history as an append-only log on top.

Failure Modes

Runtime problems — ways the context-management loop breaks even when prompt and memory are well-designed. For design-time mistakes in each pillar, see the “Anti-Patterns” section on the Prompt Design and Memory Design pages.

Failure mode	Signal	Fix
Context bloat	Window past 70% full, agent starts missing earlier decisions	Earlier watermark; lighter compaction technique for late rounds
Over-compaction	Agent asks “what was I doing?” after a compaction event	Preserve more recent turns; preserve file-edit state verbatim
Skeleton drift	System prompt has grown to 15k tokens over time as rules accumulated	Audit skeleton; move infrequently-needed content to JIT
Sub-agent context leak	Sub-agent returns a full trace instead of a summary	Tighten return contract; shorter delegation prompt
JIT thrashing	Same file re-read 5 times in one turn	Keep a fetched-this-turn cache; refer to it by identifier
Stale checkpoint	Resume lands the agent in a state that no longer matches reality (files changed)	Verify file hashes on resume; re-read if stale

Strategy Composition

No single technique handles long-running context alone. A well-engineered system uses each one where it’s most effective:

Strategy	Best at	Pairs with
Tool clearing	Routine large tool outputs the agent has already processed	Every other technique — it’s the baseline
Compaction	Long back-and-forth conversations with natural round boundaries	Structured notes, so nothing critical is lost
Structured notes	Iterative development with clear milestones, cross-session resumability	Compaction, so the notes survive
Sub-agents	Deep focused work whose intermediate trace is noise to the parent	Memory, so sub-agent learnings persist
JIT loading	Large reference material the agent needs rarely	Good tool design

These strategies compose. Compaction for conversation history, structured notes for cross-session state, sub-agents for deep research, JIT loading for most file reads, tool clearing as the always-on hygiene — each applied where it is most effective, with the attention budget as the common currency.

The Underlying Discipline

Every technique in this page — compaction, JIT, sub-agents, checkpointing — is an instance of one discipline:

Keep the window filled, and only filled, with what the agent needs to be working on right now.

This is the runtime translation of the overview’s guiding principle: the overview asks “what tokens”; this page asks “which tokens, this turn, and what leaves to make room”. Same idea, different axis. A prompt you wrote once gets re-interpreted by the runtime on every turn — context management is that re-interpretation discipline.

Not what the window might need. Not what it used to need. Not a safety net of “just in case” context. The closer the window is to that target, the better the agent performs. The further it drifts, the sooner it degrades.

Context management is the pillar that decides whether the static skeleton (prompt design) and the persistent store (memory design) actually pay off. Good prompts and good memory are necessary but not sufficient. Until the runtime consistently serves the right tokens at the right time, the agent’s measured performance will be a pale shadow of what the same prompts and memory could achieve.

Measuring It

Context management benefits from both per-turn metrics and per-task metrics:

Context occupancy — track window fill over the lifetime of a task. A flat line near the watermark is healthy; a sawtooth climbing past it and crashing via compaction is a signal your watermark is set too high.
Compaction frequency — how often does each tier fire per 100 turns? Heavy summarization firing often means earlier tiers aren’t doing enough.
Cache hit rate — direct readout on whether caching-aware ordering is paying off. Aim for 80%+ on stable tasks.
Recovery-from-checkpoint success — for resumable agents, what fraction of resumes land in a usable state? Below ~95% means the checkpoint shape is missing something.
Post-compaction regression — synthetic test: take a successful long task, force compaction at varying points, replay. Tasks that fail only after compaction indicate what the compactor dropped that shouldn’t have.

One alarm worth wiring: delta between actual and model-reported tokens. Large drift means your token accounting is wrong and your watermark is meaningless.

Prompt Design — Every compaction technique has to preserve the prompt’s stable prefix to keep caching intact. Compaction that invalidates the cache pays twice.
Memory Design — The externalization strategy for information the context manager would otherwise have to preserve across compactions. Memory and compaction compose: durable things go to memory, ephemeral things get compacted.

Sources

Effective Context Engineering for AI Agents — Anthropic, 2026
Prompt caching — Anthropic, Claude API docs
Effective Harnesses for Long-Running Agents — Anthropic, 2025
Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023, the empirical basis for “context rot” and position-sensitive retrieval
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — Asai et al., 2023, related framing of just-in-time retrieval