Cache Point Design
Agent system authors — how to design an agent for high cache hit rate: the block model first, then eight design moves for placing the 4 breakpoints, and finally the cache consequences of compaction design. Companion to the compaction page
The Goal Stated Plainly
One number decides how much you pay for a long-running agent: the cache hit ratio. At Anthropic’s pricing a cache hit costs 10% of an uncached read, a cache write costs 125% of an uncached read (5-minute TTL) or 200% (1-hour TTL). Every token that flows through the agent lives in one of three buckets:
cost per token = 0.10 × p(hit)
+ 1.25 × p(write, 5m) // or 2.00 for 1h TTL
+ 1.00 × p(uncached)
A 30-step ReAct task with hit ratio 0.90 costs roughly one-seventh of the same task run at hit ratio 0.0. The entire engineering goal of this page is moving p(hit) as close to 1 as possible — and staying there as the conversation grows and compaction fires.
This page is the design playbook. Section 1 is the mental model (the facts, aligned with Anthropic’s docs). Section 2 is the eight-move playbook for placing the 4 breakpoints. Section 3 explains why three of those moves (4, 5, 6) are compaction decisions — the cache consequences of compaction design. Section 4 shows Zapvol’s implementation. Section 5 covers scenarios beyond the single-agent happy path (sub-agents, model switching, dev iteration, when not to cache). Sections 6–8 are diagnostics (failure modes, measurement, checklist).
1. The Mental Model in One Page
A cache hit requires two continuities to hold for the same breakpoint:
- Byte continuity — every byte of the prefix (from the start of the request through the breakpoint) is byte-identical to a prior cached entry.
- Spatial continuity — that prior entry was written within 20 block positions of the current breakpoint (the window counts the breakpoint itself as the first of the 20). Beyond that, the search terminates and earlier writes are invisible.
Break either and you pay a fresh write. Picture the 4 breakpoints as 4 anchors on a chain: byte stability is how firmly each anchor grips the rock; the 20-position reach is the maximum chain length between adjacent anchors. Every move in the playbook below maintains one or both continuities.
What gets hashed — the block model
Anthropic serializes a request into a stream of blocks:
tool[0], tool[1], … │ sys[0], sys[1], … │ msg[0].content[0], …, msg[N].content[L]
A block is one tool definition, one system text block, or one content block inside a message (text, tool_use,
tool_result, image, document, thinking). Each block has a position in this linear stream.
cache_control on a block means: “at this position, hash every byte up to and including this block; on a matching
hash read from cache; otherwise walk back up to 19 earlier positions looking for a prior write; in either case, ensure a
new entry exists at this breakpoint for future requests (silently skipped when the prefix is below the model minimum).”
You get at most 4 breakpoints per request. You cannot cache partial blocks and cannot force writes at intermediate
positions. The hash is byte-sensitive — JSON with identical semantic content but different key ordering hashes
differently, which is why tool-builder runtimes that randomize map iteration order (Swift, Go) silently break caches.
Anthropic does not canonicalize JSON for you.
How matching works
- At each breakpoint, hash the full prefix (from request start through the breakpoint). Look it up.
- If miss, walk backward up to 20 positions (counting the breakpoint itself as the first), checking at each whether
a prior request’s
cache_controlsuccessfully wrote an entry at that exact position. If no match is found in the window, the search stops and the entire prefix is processed fresh. - First match wins. Blocks after the matched position are processed fresh; the current breakpoint writes a new entry under its own prefix hash.
The cache indexes prefixes, not individual blocks. A cache entry covers every byte from the start of the request up to and including the marked block — not the marked block alone. That is why any byte change in any earlier block invalidates every later breakpoint: they are all hashes of overlapping prefixes.
Anthropic’s exact phrasing: “It is looking for prior writes, not for stable content.” Translation: the lookback
only discovers positions where a prior request placed cache_control and succeeded in writing. Blocks that have merely
been present in past requests but were never marked are invisible — they simply do not exist in the cache’s index.
Concrete example (mirrors Anthropic’s docs):
- Turn 1 — 10 blocks, breakpoint on block 10 → writes entry
E1 = hash(prefix 0..10). - Turn 2 — 15 blocks, breakpoint on block 15 →
hash(0..15)misses; the lookback walks backward and findsE1at position 10 (within the window). Blocks 11–15 are processed fresh; writesE2 = hash(0..15). - Turn 3 — 35 blocks, breakpoint on block 35 →
hash(0..35)misses; the lookback checks 20 positions (blocks 35 through 16) and finds nothing.E2at position 15 sits one position outside the window → full MISS, full rewrite.
The turn-3 MISS is the motivating failure the compaction boundary (slot 2) is designed to prevent: a second explicit breakpoint placed closer to the tail keeps the lookback reaching a cached entry on every turn.
What invalidates what
Because hashes cover all bytes up to the breakpoint, any byte change in any earlier block invalidates every later breakpoint. The cascade:
| Change | Invalidates |
|---|---|
| Tool definition bytes (any) | tools + system + messages |
| System text bytes (any) | system + messages |
| Web search toggle | system + messages |
| Citations toggle | system + messages |
Speed setting (speed: "fast" vs standard) | system + messages |
tool_choice parameter | messages only |
| Extended-thinking parameters (enable/budget) | messages only |
| Images added / removed anywhere | messages only |
| Non-tool-result user content (with thinking) | strips prior thinking blocks |
Minimum cacheable prefix
If the prefix at your breakpoint is below the model’s minimum, the cache write is silently skipped:
| Model | Minimum (tokens) |
|---|---|
| Claude Opus 4.7 / 4.6 / 4.5 | 4,096 |
| Claude Haiku 4.5 | 4,096 |
| Claude Sonnet 4.6 | 2,048 |
| Claude Sonnet 4.5 / 4 | 1,024 |
| Claude Opus 4.1 / 4 | 1,024 |
That’s the whole model. Everything below is about engineering around it.
2. The Playbook — Eight Moves, Ranked by Cost Impact
Each move is expressed as “what to do” + “why it raises hit rate” + “when it misfires.” Apply top-down; downstream moves assume upstream ones are in place.
Move 1 — Cache the tools + system head with one breakpoint (slot 1)
Do: put cache_control on the last system text block.
Because cache covers everything up to and including the breakpoint block, this single mark captures the entire tool list and the full system prompt. On most agents this prefix is 30–80% of every request’s token count and is byte-identical from the second turn onward. Converting it from 1.0× to 0.1× on every turn after the first is the single largest cost win available.
Misfires if: the system prompt or tool list changes between turns. See Move 2 / 3.
Move 2 — Pin tool definitions byte-identical across turns
Do: compute the filtered tool list once per session; serialize tools with deterministic JSON key order; avoid runtime-varying text inside tool descriptions.
Tools are the first cache level. Any byte change cascades into every cache level. A tool list sorted differently between turns (even alphabetically-by-name vs definition-order) kills all four breakpoints. A tool description that includes “current time is 14:32” blows the cache every turn.
Common hidden non-determinism: iteration order of Map or Set in the tool builder, runtime permission filtering,
localization with Intl.DateTimeFormat.
Move 3 — Deterministic block serialization everywhere
Do: scrub timestamps, request IDs, random nonces, and unstable map-iteration-order from tool definitions, system text, and tool_result output.
Two blocks with identical semantic content but different bytes are different blocks to the cache. Deterministic serialization is the fence around Move 1 that keeps it working session after session.
Common places mutable bytes leak in: stringified error objects, JSON-ified dates, tool wrappers that log the current
timestamp into their output payload, and summariser prompts that include {{now}}.
Move 4 — Anchor the mid-prefix with the compaction boundary (slot 2)
Do: place a second breakpoint at a position that stays byte-identical across turns within the compaction epoch. Use the compaction boundary — the index of the last summarised message, published by the compactor.
Slot 1 alone is not enough on long conversations. Once the window grows past 20 block positions since slot 1 was written, the 20-block lookback can no longer reach slot 1 and turn N+1 starts paying a fresh write for the entire head. Slot 2 closes that gap by giving the lookback a nearer landing pad.
A naive mid-point (Math.floor(length / 2)) drifts by one every turn — each drift is a miss. The compaction boundary,
by contrast, is byte-identical across every step of the same epoch. In Zapvol this is threaded as extraBreakpointAt.
Why this specific position is the uniquely stable mid-prefix anchor — not just “a convenient one” — is spelled out
in §3.
Misfires if: compaction rewrites content before the boundary. See Move 5.
Move 5 — Compaction is append-only past the immutable head
Do: design the compactor so it never mutates blocks before slot 1’s breakpoint, and only replaces whole-round segments with summary blocks after slot 2’s position.
Compaction that rewrites any earlier block (even a trivial “add a note” update to the system prompt) is a 100% cache-miss operation — no amount of clever breakpoint placement can recover it. The invariant is: head stays; tail may be rewritten. Every such rewrite is one cache epoch (one expensive write, many cheap reads).
Move 6 — Compact at whole-round, whole-block granularity
Do: replace whole rounds with one summary block; truncate whole toolresult blocks to a _fixed length every time the same block is serialized; never edit fields inside a block.
Sub-block mutations defeat cache granularity. A compactor that trims one tool_result to drop a noisy middle section
changes that block’s bytes — which changes every later prefix hash — while only saving a few kilobytes. Prefer:
- Replace, not edit. A “cleared” tool result becomes a new stub block; the original block is gone.
- Fixed truncation lengths. 4 KB is a stable number; “truncate when convenient” is not.
- Round boundaries, not message boundaries. A compacted round is one summary block; a half-compacted round is a cache hazard.
Move 7 — Mark the last user and last message (slots 3 and 4)
Do: place a breakpoint on the last user message (if distinct from the tail) and another on the last message.
During a tool-heavy inner loop the user question stays fixed while the assistant accumulates tool_use and
tool_result blocks. Without slot 3, a long tool chain exceeds 20 blocks and the lookback loses slot 2. Slot 4 is the
forward-looking slot: writing at the tail now lets the next turn hit this exact prefix.
Misfires if: the last message contains a timestamp or per-request ID. See Move 3.
Move 8 — Pick TTL to match compaction cadence
| TTL | Write multiplier | Use when |
|---|---|---|
| 5 minutes | 1.25× | Inner-loop tool iteration, interactive chat, dev/test |
| 1 hour | 2.00× | Long-running agent jobs, HITL sessions with gaps, batch |
Decision rule: expected_reuses_within_ttl × 0.9 > write_premium. Breakeven: ~0.28 reuses for 5m, ~1.11 for 1h.
Mixed TTLs allowed, but longer TTLs must appear before shorter ones in the request order. Typical mixed pattern: 1h on slots 1–2 (slow-moving head and mid), 5m on slots 3–4 (fast-moving tail).
Misfires if: 5m TTL + compaction every 20 minutes → cache always expires before post-compaction reuse → cold write on every post-compaction step.
3. Compaction’s Impact on Cache
Of the eight moves above, three (Moves 4, 5, 6) are purely compaction decisions and a fourth (Move 8, TTL) is tied to compaction cadence. This is not accidental: compaction is the only context-engineering operation that rewrites the block stream itself. Every other technique — memory, JIT loading, sub-agent isolation, prompt design — keeps existing blocks byte-stable. When you rewrite blocks, you either extend the cache epoch forward or blow it up.
Four dimensions of compaction design each have a direct cache consequence:
| Compaction dimension | Cache effect |
|---|---|
| Boundary | The one byte-stable position a growing conversation can anchor slot 2 on |
| Scope (where it rewrites) | If it touches the head, all 4 slots invalidate simultaneously on the next turn |
| Granularity (whole- vs sub-block) | Sub-block edits break byte continuity; whole-block replacement preserves it |
| Cadence (how often it fires) | Determines how many cache reads amortise each compaction’s cache write |
The boundary is the anchor nothing else provides
The compactor publishes “the index of the last summarised message” every step of the epoch. That index is the only position in a growing conversation whose prefix is byte-identical across every step of the epoch. It is the only place slot 2 can sit and keep hitting.
Without compaction, slot 2 has nowhere to land. Math.floor(length / 2) drifts every turn. “Every 15th block” drifts
every turn. There is no other epoch-stable position inside the message tail. Compaction is what creates the anchor;
caching is what consumes it.
Touching the head is catastrophic
If compaction ever rewrites blocks before slot 1’s breakpoint — e.g., a feature that “refreshes the system prompt from memory mid-session” — slot 1’s prefix hash changes, and every single later breakpoint invalidates simultaneously on the next turn. There is no partial recovery. The invariant is simple and absolute: head is immutable; tail may be rewritten. Violations cost the full prefix as a cold write on every subsequent turn until the next session boundary.
Granularity decides byte continuity
Within the tail, compaction has two temptations — replace whole rounds with a summary block, or edit fields inside existing blocks to trim noise. The second is cheaper in summariser tokens but destroys the block’s byte hash, which cascades into every later prefix hash. The retained blocks must stay byte-identical across every step of the epoch; the only way to guarantee that is to never edit in place — always replace whole blocks, always truncate to a fixed length, never mutate fields.
Cadence decides amortization
Every compaction event is a new cache epoch: one expensive write against the new prefix, then many cheap reads. Compacting every 3 turns means paying the write tax frequently; compacting at task-scope boundaries means amortizing the same write across 20+ subsequent steps. The ratio of reads to writes within the TTL window is what the cadence decision actually controls. Pick cadences that keep this ratio above ~10 and compaction becomes nearly free; pick them badly and compaction becomes the dominant cost line.
Financial breakeven (from Move 8) is lower than the ~10 operational target: ~0.28 reuses per write for 5m TTL, ~1.11 for 1h. The gap between “breakeven” and “~10” is your margin against variance — writes that never get reused (session ends, user abandons), TTL expiring before the next compaction, prefixes churning from non-determinism. Design for the operational target, not for breakeven; breakeven is a cliff, not a plateau.
The concrete impact
The difference between “compaction designed for cache” and “compaction designed without thinking about cache” is not incremental — it is binary and large. Representative numbers for a 30-step tool-heavy agent on Opus 4.7:
- Compactor rewrites the system prompt mid-session, or edits tool_result bytes in place → hit rate ≈ 0.2
- Compactor is append-only, whole-round, boundary published as slot 2 → hit rate ≈ 0.9
- Same task, same tools, same conversation length — roughly 5× total cost difference, driven entirely by how the compactor touches the block stream.
That is why three of the eight moves above (4, 5, 6) are compaction decisions and a fourth (Move 8, TTL) is tied to compaction cadence: there is no other single piece of the agent harness with as much cache leverage per line of code changed. Every compaction decision is also a cache decision, inseparable at the design level.
4. Zapvol’s Implementation
packages/backend/src/agent/model.ts implements the playbook in two functions:
createCachedInstructions(instructions, model)— wraps the system prompt withcache_control, implicitly capturing all preceding tool definitions. This is Move 1.applyCacheControl(messages, model, { extraBreakpointAt })— places breakpoints at the last message (slot 4), the last user message if distinct (slot 3), and a mid-prefix anchor (slot 2). TakesextraBreakpointAtfrom the compactor (Move 4); falls back toMath.floor(length / 2)only when compaction hasn’t fired yet and the conversation exceeds 20 messages.
Unit note: Anthropic’s lookback window is measured in blocks (20 blocks per breakpoint); the Zapvol fallback
threshold above is measured in messages (20 messages). One message typically contains 2–5 blocks (text +
tool_use + tool_result), so the 20-message threshold is a conservative proxy — by the time it fires, the tail
has usually grown well past 20 blocks. A tighter proxy would kick in earlier at the cost of burning a breakpoint on
short conversations where it isn’t yet needed.
The caller (agent-round.ts) threads the compaction epoch’s compactedPrefixEnd into the cache layer so Moves 4, 5,
and 6 compose: the compactor produces a stable boundary, the cache layer anchors slot 2 on that exact position, and the
compacted prefix stays byte-identical across every step of the epoch.
Telemetry: applyCacheControl emits cache.breakpoints_placed with messagesCount, compactedPrefixEnd,
extraBreakpointUsed, placedAt, lastRole. The operations dashboard queries this shape as the primary hit-ratio
source — do not change it without updating the dashboards.
Automatic caching — and why Zapvol does not use it
Anthropic offers an automatic caching mode: setting cache_control at the top level of the request (not on any
specific block) makes the system auto-place a breakpoint on the last cacheable block of every turn, advancing it forward
as the conversation grows. Each new turn writes the new tail; previous turns read from cache through the 20-position
lookback.
This is essentially slot 4 done by the API. For chat-style agents (1–2 new blocks per turn) it is sufficient on its own — the lookback can always reach the previous turn’s write, and the head prefix (tools + system) is small enough that full re-reads remain cheap.
For tool-heavy agents it is not sufficient. A single ReAct step can append 5–10 blocks (tool_use + tool_result + assistant text). After two or three such steps the 20-position window can no longer reach the head, and every request pays a cold write for the entire tools + system prefix. The fix requires explicit slot 1 (head) and slot 2 (mid-prefix anchor at the compaction boundary) — and at that point slot 4 is also trivially explicit alongside them.
Zapvol therefore uses all-explicit four-slot caching via createCachedInstructions + applyCacheControl and does
not set top-level cache_control. All four slots are budgeted by us, not the API.
Edge cases to keep in mind if top-level automatic caching is ever enabled:
- Automatic caching consumes one of the 4 slots.
- If the last block already has an explicit
cache_controlwith the same TTL, automatic is a no-op. - If the last block has an explicit
cache_controlwith a different TTL, the API returns a 400. - If 4 explicit breakpoints are already present, automatic has no slot and the API returns a 400.
- If the last block is not an eligible cache target, automatic walks backward to the nearest eligible block; if none is found, automatic is silently skipped.
5. Beyond the Single-Agent Happy Path
The playbook and Zapvol implementation above assume a single-agent, single-model, prompt-stable main path. Real deployments hit four branches where the cache story shifts.
Sub-agents each own their cache chain
Registered sub-agents (browser, write-todos, task children) run as independent API request sequences with their
own applyCacheControl stack. A sub-agent’s conversation does not inherit the parent’s message history (CLAUDE.md:
“sub-agent isolation”), so parent and child have separate prefix hashes — cache never flows between them.
Consequences:
- Each sub-agent pays its own slot-1 cold write on first invocation in a session; subsequent invocations of the same sub-agent within TTL can read.
- Segment the hit-ratio dashboard by agent type — mixing parent and sub-agent reads in one average hides which chain is actually healthy.
- Don’t try to design shared cache across parent↔child. Separate requests, separate hashes, period.
Model switching mid-session resets the cache
Anthropic’s cache is keyed by (model, prefix). Switching the model mid-session (BYOK swap, user-requested downgrade, A/B test) means:
- First request after switch:
cache_creation_input_tokens > 0, cache_read_input_tokens = 0. Expected — not a regression. - Model minimums differ. Switching from Opus (4,096) to Sonnet 4.6 (2,048) can make some previously uncacheable prompts newly cacheable; the reverse can happen too.
- Cold-start assertions (§7) need to fire on every model boundary, not just session start. If your test suite switches models, wire the “turn 1 behavior” expectation accordingly.
Dev iteration churns the cache
During active development of system prompts, tool descriptions, or compaction formats, every edit = new prefix hash = cold write on every request. Costs:
- 1.25× on the churned prefix, hit ratio ≈ 0 until the prompt stabilizes.
- Recommendation: during heavy iteration, either accept the dev-time cost as cheap signal, or temporarily stop wrapping
with
createCachedInstructionsunder a dev flag. Re-enable once prompts stabilize and verify turn-2 reads appear (§7 cold-start check). - Do not leave iteration-time caching disabled in prod by accident — gate it on
NODE_ENVor an explicit flag.
When not to cache
Some workloads cost more with caching than without. Skip caching for:
- One-shot or unique-per-request prompts (classification over unique user text, one-off analyses): every request is a fresh write, reads never come — pure 1.25× tax.
- High-variance per-request content baked into the head (timestamps, per-request IDs, user-specific data in tool definitions): write is cold even though intent is stable. Fix the variance first (Move 3) before enabling cache, not after.
- Sub-minimum prompts (below the model’s token floor — §1 table): writes silently skipped;
cache_controlis a no-op. Pad only if the prompt is reused across requests. - Fan-out batch processing (10k independent requests in parallel, none reuses): cache is dead weight.
Gate rule (from Move 8): expected_hits_per_write × 0.9 > write_premium. Below breakeven (≈0.28 reuses for 5m, ≈1.11
for 1h), don’t cache.
6. Failure Modes (Reverse Playbook)
Every cache-miss regression traces back to violating one move. Use this as a diagnostic index:
| Symptom | Violated move | Fix |
|---|---|---|
Zero cache_read_input_tokens on turn 2 | Move 1 or 3 | Verify breakpoint present; audit tool + system bytes |
| Hit rate collapses once conversation grows past 20 blocks | Move 4 | Anchor slot 2 on the compaction boundary so lookback can still reach the head |
| Hit rate collapses immediately after compaction | Move 5 or 6 | Check that compaction is append-only + whole-block |
| Hit rate 0.2-0.4 on every turn | Move 2 or 3 | Hunt non-determinism in tool / message serialization |
Turn 1 cache_creation_input_tokens = 0 | Below-min | Pad system prompt or accept the cost |
| Post-compaction step always a full miss | Move 8 | TTL too short for compaction cadence |
| Hit rate collapses after user sends mid-task text | Thinking | Route mid-task input through tool_result if possible |
Hit rate varies sharply by tool_choice | Move 2 | Pin tool_choice for the session |
7. Measurement
Cache hit ratio — the headline metric
hit_ratio = cache_read_input_tokens
/ (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)
Targets on a well-behaved long task: > 0.85 after turn 3, > 0.90 on sustained multi-turn sessions. Below 0.5 is a bug, not a tradeoff.
The cache.breakpoints_placed event
Dashboards worth building:
- Breakpoint count histogram — should cluster at 3 or 4. A mode at 1 means only the system prompt was cached.
extraBreakpointUsedratio by task age — should rise toward 1 as tasks cross the compaction threshold.- Adjacent
placedAtgap distribution — adjacent marks should stay under 20 blocks.
Cold-start verification
# Turn 1
assert response.usage.cache_creation_input_tokens > 0 # Move 1 works
assert response.usage.cache_read_input_tokens == 0
# Turn 2
assert response.usage.cache_read_input_tokens > 0 # Move 2/3 holding
Turn 2 with zero reads means the prefix changed between turns — walk the failure-mode table above.
8. Pre-Ship Checklist
- Move 1 —
cache_controlon last system block; slot 1 caches tools + system. - Move 2 — tool list computed once per session; JSON key order pinned.
- Move 3 — no timestamps, request IDs, or unstable map iteration in any cached block.
- Move 4 — slot 2 anchored on the compaction boundary (or length-midpoint fallback).
- Move 5 — compactor does not mutate blocks before slot 1.
- Move 6 — compaction replaces whole rounds / whole blocks; truncation length is fixed.
- Move 7 — last user message and last message each carry a breakpoint.
- Move 8 — TTL matches compaction cadence (5m for tight loops, 1h for long-running).
- Telemetry —
cache.breakpoints_placedemitted;cache_read_input_tokenscaptured; hit ratio dashboarded. - Turns 1 and 2 verified — writes on turn 1, reads on turn 2.
Related Reading
- ← Compaction — The summarisation layer whose stable round boundary is what Move 4 depends on.
- Context Management — The attention budget; cache hit ratio is its cost dimension.
- Memory Design — What lives outside the window does not enter the cache epoch and cannot invalidate it.
- Operations / Dashboards — Where
cache.breakpoints_placedlands.
Sources
- Prompt caching — Anthropic, Claude API docs. Primary reference for the block model, 20-block lookback, invalidation rules, and pricing.
- Effective Context Engineering for AI Agents — Anthropic, 2026. Cache-aware compaction guidance.
packages/backend/src/agent/model.ts— Zapvol’sapplyCacheControl/createCachedInstructions.packages/backend/src/agent/agent-round.ts— the caller threading the compaction boundary into the cache layer.