Cache Point Design

The Goal Stated Plainly

One number decides how much you pay for a long-running agent: the cache hit ratio. At Anthropic’s pricing a cache hit costs 10% of an uncached read, a cache write costs 125% of an uncached read (5-minute TTL) or 200% (1-hour TTL). Every token that flows through the agent lives in one of three buckets:

cost per token  =  0.10 × p(hit)
                +  1.25 × p(write, 5m)     // or 2.00 for 1h TTL
                +  1.00 × p(uncached)

A 30-step ReAct task with hit ratio 0.90 costs roughly one-seventh of the same task run at hit ratio 0.0. The entire engineering goal of this page is moving p(hit) as close to 1 as possible — and staying there as the conversation grows and compaction fires.

This page is the design playbook. Section 1 is the mental model (the facts, aligned with Anthropic’s docs). Section 2 is the eight-move playbook for placing the 4 breakpoints. Section 3 explains why three of those moves (4, 5, 6) are compaction decisions — the cache consequences of compaction design. Section 4 shows Zapvol’s implementation. Section 5 covers scenarios beyond the single-agent happy path (sub-agents, model switching, dev iteration, when not to cache). Sections 6–8 are diagnostics (failure modes, measurement, checklist).

1. The Mental Model in One Page

A cache hit requires two continuities to hold for the same breakpoint:

Byte continuity — every byte of the prefix (from the start of the request through the breakpoint) is byte-identical to a prior cached entry.
Spatial continuity — that prior entry was written within 20 block positions of the current breakpoint (the window counts the breakpoint itself as the first of the 20). Beyond that, the search terminates and earlier writes are invisible.

Break either and you pay a fresh write. Picture the 4 breakpoints as 4 anchors on a chain: byte stability is how firmly each anchor grips the rock; the 20-position reach is the maximum chain length between adjacent anchors. Every move in the playbook below maintains one or both continuities.

What gets hashed — the block model

Anthropic serializes a request into a stream of blocks:

tool[0], tool[1], …  │  sys[0], sys[1], …  │  msg[0].content[0], …, msg[N].content[L]

A block is one tool definition, one system text block, or one content block inside a message (text, tool_use, tool_result, image, document, thinking). Each block has a position in this linear stream.

cache_control on a block means: “at this position, hash every byte up to and including this block; on a matching hash read from cache; otherwise walk back up to 19 earlier positions looking for a prior write; in either case, ensure a new entry exists at this breakpoint for future requests (silently skipped when the prefix is below the model minimum).” You get at most 4 breakpoints per request. You cannot cache partial blocks and cannot force writes at intermediate positions. The hash is byte-sensitive — JSON with identical semantic content but different key ordering hashes differently, which is why tool-builder runtimes that randomize map iteration order (Swift, Go) silently break caches. Anthropic does not canonicalize JSON for you.

How matching works

At each breakpoint, hash the full prefix (from request start through the breakpoint). Look it up.
If miss, walk backward up to 20 positions (counting the breakpoint itself as the first), checking at each whether a prior request’s cache_control successfully wrote an entry at that exact position. If no match is found in the window, the search stops and the entire prefix is processed fresh.
First match wins. Blocks after the matched position are processed fresh; the current breakpoint writes a new entry under its own prefix hash.

The cache indexes prefixes, not individual blocks. A cache entry covers every byte from the start of the request up to and including the marked block — not the marked block alone. That is why any byte change in any earlier block invalidates every later breakpoint: they are all hashes of overlapping prefixes.

Anthropic’s exact phrasing: “It is looking for prior writes, not for stable content.” Translation: the lookback only discovers positions where a prior request placed cache_control and succeeded in writing. Blocks that have merely been present in past requests but were never marked are invisible — they simply do not exist in the cache’s index.

Concrete example (mirrors Anthropic’s docs):

Turn 1 — 10 blocks, breakpoint on block 10 → writes entry E1 = hash(prefix 0..10).
Turn 2 — 15 blocks, breakpoint on block 15 → hash(0..15) misses; the lookback walks backward and finds E1 at position 10 (within the window). Blocks 11–15 are processed fresh; writes E2 = hash(0..15).
Turn 3 — 35 blocks, breakpoint on block 35 → hash(0..35) misses; the lookback checks 20 positions (blocks 35 through 16) and finds nothing. E2 at position 15 sits one position outside the window → full MISS, full rewrite.

The turn-3 MISS is the motivating failure the compaction boundary (slot 2) is designed to prevent: a second explicit breakpoint placed closer to the tail keeps the lookback reaching a cached entry on every turn.

What invalidates what

Because hashes cover all bytes up to the breakpoint, any byte change in any earlier block invalidates every later breakpoint. The cascade:

Change	Invalidates
Tool definition bytes (any)	tools + system + messages
System text bytes (any)	system + messages
Web search toggle	system + messages
Citations toggle	system + messages
Speed setting (`speed: "fast"` vs standard)	system + messages
`tool_choice` parameter	messages only
Extended-thinking parameters (enable/budget)	messages only
Images added / removed anywhere	messages only
Non-tool-result user content (with thinking)	strips prior thinking blocks

Minimum cacheable prefix

If the prefix at your breakpoint is below the model’s minimum, the cache write is silently skipped:

Model	Minimum (tokens)
Claude Opus 4.7 / 4.6 / 4.5	4,096
Claude Haiku 4.5	4,096
Claude Sonnet 4.6	2,048
Claude Sonnet 4.5 / 4	1,024
Claude Opus 4.1 / 4	1,024

That’s the whole model. Everything below is about engineering around it.

2. The Playbook — Eight Moves, Ranked by Cost Impact

Each move is expressed as “what to do” + “why it raises hit rate” + “when it misfires.” Apply top-down; downstream moves assume upstream ones are in place.

Move 1 — Cache the tools + system head with one breakpoint (slot 1)

Do: put cache_control on the last system text block.

Because cache covers everything up to and including the breakpoint block, this single mark captures the entire tool list and the full system prompt. On most agents this prefix is 30–80% of every request’s token count and is byte-identical from the second turn onward. Converting it from 1.0× to 0.1× on every turn after the first is the single largest cost win available.

Misfires if: the system prompt or tool list changes between turns. See Move 2 / 3.

Move 2 — Pin tool definitions byte-identical across turns

Do: compute the filtered tool list once per session; serialize tools with deterministic JSON key order; avoid runtime-varying text inside tool descriptions.

Tools are the first cache level. Any byte change cascades into every cache level. A tool list sorted differently between turns (even alphabetically-by-name vs definition-order) kills all four breakpoints. A tool description that includes “current time is 14:32” blows the cache every turn.

Common hidden non-determinism: iteration order of Map or Set in the tool builder, runtime permission filtering, localization with Intl.DateTimeFormat.

Move 3 — Deterministic block serialization everywhere

Do: scrub timestamps, request IDs, random nonces, and unstable map-iteration-order from tool definitions, system text, and tool_result output.

Two blocks with identical semantic content but different bytes are different blocks to the cache. Deterministic serialization is the fence around Move 1 that keeps it working session after session.

Common places mutable bytes leak in: stringified error objects, JSON-ified dates, tool wrappers that log the current timestamp into their output payload, and summariser prompts that include {{now}}.

Move 4 — Anchor the mid-prefix with the compaction boundary (slot 2)

Do: place a second breakpoint at a position that stays byte-identical across turns within the compaction epoch. Use the compaction boundary — the index of the last summarised message, published by the compactor.

Slot 1 alone is not enough on long conversations. Once the window grows past 20 block positions since slot 1 was written, the 20-block lookback can no longer reach slot 1 and turn N+1 starts paying a fresh write for the entire head. Slot 2 closes that gap by giving the lookback a nearer landing pad.

A naive mid-point (Math.floor(length / 2)) drifts by one every turn — each drift is a miss. The compaction boundary, by contrast, is byte-identical across every step of the same epoch. In Zapvol this is threaded as extraBreakpointAt. Why this specific position is the uniquely stable mid-prefix anchor — not just “a convenient one” — is spelled out in §3.

Misfires if: compaction rewrites content before the boundary. See Move 5.

Move 5 — Compaction is append-only past the immutable head

Do: design the compactor so it never mutates blocks before slot 1’s breakpoint, and only replaces whole-round segments with summary blocks after slot 2’s position.

Compaction that rewrites any earlier block (even a trivial “add a note” update to the system prompt) is a 100% cache-miss operation — no amount of clever breakpoint placement can recover it. The invariant is: head stays; tail may be rewritten. Every such rewrite is one cache epoch (one expensive write, many cheap reads).

Move 6 — Compact at whole-round, whole-block granularity

Do: replace whole rounds with one summary block; truncate whole toolresult blocks to a _fixed length every time the same block is serialized; never edit fields inside a block.

Sub-block mutations defeat cache granularity. A compactor that trims one tool_result to drop a noisy middle section changes that block’s bytes — which changes every later prefix hash — while only saving a few kilobytes. Prefer:

Replace, not edit. A “cleared” tool result becomes a new stub block; the original block is gone.
Fixed truncation lengths. 4 KB is a stable number; “truncate when convenient” is not.
Round boundaries, not message boundaries. A compacted round is one summary block; a half-compacted round is a cache hazard.

Move 7 — Mark the last user and last message (slots 3 and 4)

Do: place a breakpoint on the last user message (if distinct from the tail) and another on the last message.

During a tool-heavy inner loop the user question stays fixed while the assistant accumulates tool_use and tool_result blocks. Without slot 3, a long tool chain exceeds 20 blocks and the lookback loses slot 2. Slot 4 is the forward-looking slot: writing at the tail now lets the next turn hit this exact prefix.

Misfires if: the last message contains a timestamp or per-request ID. See Move 3.

Move 8 — Pick TTL to match compaction cadence

TTL	Write multiplier	Use when
5 minutes	1.25×	Inner-loop tool iteration, interactive chat, dev/test
1 hour	2.00×	Long-running agent jobs, HITL sessions with gaps, batch

Decision rule: expected_reuses_within_ttl × 0.9 > write_premium. Breakeven: ~0.28 reuses for 5m, ~1.11 for 1h.

Mixed TTLs allowed, but longer TTLs must appear before shorter ones in the request order. Typical mixed pattern: 1h on slots 1–2 (slow-moving head and mid), 5m on slots 3–4 (fast-moving tail).

Misfires if: 5m TTL + compaction every 20 minutes → cache always expires before post-compaction reuse → cold write on every post-compaction step.

3. Compaction’s Impact on Cache

Of the eight moves above, three (Moves 4, 5, 6) are purely compaction decisions and a fourth (Move 8, TTL) is tied to compaction cadence. This is not accidental: compaction is the only context-engineering operation that rewrites the block stream itself. Every other technique — memory, JIT loading, sub-agent isolation, prompt design — keeps existing blocks byte-stable. When you rewrite blocks, you either extend the cache epoch forward or blow it up.

Four dimensions of compaction design each have a direct cache consequence:

Compaction dimension	Cache effect
Boundary	The one byte-stable position a growing conversation can anchor slot 2 on
Scope (where it rewrites)	If it touches the head, all 4 slots invalidate simultaneously on the next turn
Granularity (whole- vs sub-block)	Sub-block edits break byte continuity; whole-block replacement preserves it
Cadence (how often it fires)	Determines how many cache reads amortise each compaction’s cache write

The boundary is the anchor nothing else provides

The compactor publishes “the index of the last summarised message” every step of the epoch. That index is the only position in a growing conversation whose prefix is byte-identical across every step of the epoch. It is the only place slot 2 can sit and keep hitting.

Without compaction, slot 2 has nowhere to land. Math.floor(length / 2) drifts every turn. “Every 15th block” drifts every turn. There is no other epoch-stable position inside the message tail. Compaction is what creates the anchor; caching is what consumes it.

Touching the head is catastrophic

If compaction ever rewrites blocks before slot 1’s breakpoint — e.g., a feature that “refreshes the system prompt from memory mid-session” — slot 1’s prefix hash changes, and every single later breakpoint invalidates simultaneously on the next turn. There is no partial recovery. The invariant is simple and absolute: head is immutable; tail may be rewritten. Violations cost the full prefix as a cold write on every subsequent turn until the next session boundary.

Granularity decides byte continuity

Within the tail, compaction has two temptations — replace whole rounds with a summary block, or edit fields inside existing blocks to trim noise. The second is cheaper in summariser tokens but destroys the block’s byte hash, which cascades into every later prefix hash. The retained blocks must stay byte-identical across every step of the epoch; the only way to guarantee that is to never edit in place — always replace whole blocks, always truncate to a fixed length, never mutate fields.

Cadence decides amortization

Every compaction event is a new cache epoch: one expensive write against the new prefix, then many cheap reads. Compacting every 3 turns means paying the write tax frequently; compacting at task-scope boundaries means amortizing the same write across 20+ subsequent steps. The ratio of reads to writes within the TTL window is what the cadence decision actually controls. Pick cadences that keep this ratio above ~10 and compaction becomes nearly free; pick them badly and compaction becomes the dominant cost line.

Financial breakeven (from Move 8) is lower than the ~10 operational target: ~0.28 reuses per write for 5m TTL, ~1.11 for 1h. The gap between “breakeven” and “~10” is your margin against variance — writes that never get reused (session ends, user abandons), TTL expiring before the next compaction, prefixes churning from non-determinism. Design for the operational target, not for breakeven; breakeven is a cliff, not a plateau.

The concrete impact

The difference between “compaction designed for cache” and “compaction designed without thinking about cache” is not incremental — it is binary and large. Representative numbers for a 30-step tool-heavy agent on Opus 4.7:

Compactor rewrites the system prompt mid-session, or edits tool_result bytes in place → hit rate ≈ 0.2
Compactor is append-only, whole-round, boundary published as slot 2 → hit rate ≈ 0.9
Same task, same tools, same conversation length — roughly 5× total cost difference, driven entirely by how the compactor touches the block stream.

That is why three of the eight moves above (4, 5, 6) are compaction decisions and a fourth (Move 8, TTL) is tied to compaction cadence: there is no other single piece of the agent harness with as much cache leverage per line of code changed. Every compaction decision is also a cache decision, inseparable at the design level.

4. Zapvol’s Implementation

packages/backend/src/agent/model.ts implements the playbook in two functions:

createCachedInstructions(instructions, model) — wraps the system prompt with cache_control, implicitly capturing all preceding tool definitions. This is Move 1.
applyCacheControl(messages, model, { extraBreakpointAt }) — places breakpoints at the last message (slot 4), the last user message if distinct (slot 3), and a mid-prefix anchor (slot 2). Takes extraBreakpointAt from the compactor (Move 4); falls back to Math.floor(length / 2) only when compaction hasn’t fired yet and the conversation exceeds 20 messages.

Unit note: Anthropic’s lookback window is measured in blocks (20 blocks per breakpoint); the Zapvol fallback threshold above is measured in messages (20 messages). One message typically contains 2–5 blocks (text + tool_use + tool_result), so the 20-message threshold is a conservative proxy — by the time it fires, the tail has usually grown well past 20 blocks. A tighter proxy would kick in earlier at the cost of burning a breakpoint on short conversations where it isn’t yet needed.

The caller (agent-round.ts) threads the compaction epoch’s compactedPrefixEnd into the cache layer so Moves 4, 5, and 6 compose: the compactor produces a stable boundary, the cache layer anchors slot 2 on that exact position, and the compacted prefix stays byte-identical across every step of the epoch.

Telemetry: applyCacheControl emits cache.breakpoints_placed with messagesCount, compactedPrefixEnd, extraBreakpointUsed, placedAt, lastRole. The operations dashboard queries this shape as the primary hit-ratio source — do not change it without updating the dashboards.

Automatic caching — and why Zapvol does not use it

Anthropic offers an automatic caching mode: setting cache_control at the top level of the request (not on any specific block) makes the system auto-place a breakpoint on the last cacheable block of every turn, advancing it forward as the conversation grows. Each new turn writes the new tail; previous turns read from cache through the 20-position lookback.

This is essentially slot 4 done by the API. For chat-style agents (1–2 new blocks per turn) it is sufficient on its own — the lookback can always reach the previous turn’s write, and the head prefix (tools + system) is small enough that full re-reads remain cheap.

For tool-heavy agents it is not sufficient. A single ReAct step can append 5–10 blocks (tool_use + tool_result + assistant text). After two or three such steps the 20-position window can no longer reach the head, and every request pays a cold write for the entire tools + system prefix. The fix requires explicit slot 1 (head) and slot 2 (mid-prefix anchor at the compaction boundary) — and at that point slot 4 is also trivially explicit alongside them.

Zapvol therefore uses all-explicit four-slot caching via createCachedInstructions + applyCacheControl and does not set top-level cache_control. All four slots are budgeted by us, not the API.

Edge cases to keep in mind if top-level automatic caching is ever enabled:

Automatic caching consumes one of the 4 slots.
If the last block already has an explicit cache_control with the same TTL, automatic is a no-op.
If the last block has an explicit cache_control with a different TTL, the API returns a 400.
If 4 explicit breakpoints are already present, automatic has no slot and the API returns a 400.
If the last block is not an eligible cache target, automatic walks backward to the nearest eligible block; if none is found, automatic is silently skipped.

5. Beyond the Single-Agent Happy Path

The playbook and Zapvol implementation above assume a single-agent, single-model, prompt-stable main path. Real deployments hit four branches where the cache story shifts.

Sub-agents each own their cache chain

Registered sub-agents (browser, write-todos, task children) run as independent API request sequences with their own applyCacheControl stack. A sub-agent’s conversation does not inherit the parent’s message history (CLAUDE.md: “sub-agent isolation”), so parent and child have separate prefix hashes — cache never flows between them. Consequences:

Each sub-agent pays its own slot-1 cold write on first invocation in a session; subsequent invocations of the same sub-agent within TTL can read.
Segment the hit-ratio dashboard by agent type — mixing parent and sub-agent reads in one average hides which chain is actually healthy.
Don’t try to design shared cache across parent↔child. Separate requests, separate hashes, period.

Model switching mid-session resets the cache

Anthropic’s cache is keyed by (model, prefix). Switching the model mid-session (BYOK swap, user-requested downgrade, A/B test) means:

First request after switch: cache_creation_input_tokens > 0, cache_read_input_tokens = 0. Expected — not a regression.
Model minimums differ. Switching from Opus (4,096) to Sonnet 4.6 (2,048) can make some previously uncacheable prompts newly cacheable; the reverse can happen too.
Cold-start assertions (§7) need to fire on every model boundary, not just session start. If your test suite switches models, wire the “turn 1 behavior” expectation accordingly.

Dev iteration churns the cache

During active development of system prompts, tool descriptions, or compaction formats, every edit = new prefix hash = cold write on every request. Costs:

1.25× on the churned prefix, hit ratio ≈ 0 until the prompt stabilizes.
Recommendation: during heavy iteration, either accept the dev-time cost as cheap signal, or temporarily stop wrapping with createCachedInstructions under a dev flag. Re-enable once prompts stabilize and verify turn-2 reads appear (§7 cold-start check).
Do not leave iteration-time caching disabled in prod by accident — gate it on NODE_ENV or an explicit flag.

When not to cache

Some workloads cost more with caching than without. Skip caching for:

One-shot or unique-per-request prompts (classification over unique user text, one-off analyses): every request is a fresh write, reads never come — pure 1.25× tax.
High-variance per-request content baked into the head (timestamps, per-request IDs, user-specific data in tool definitions): write is cold even though intent is stable. Fix the variance first (Move 3) before enabling cache, not after.
Sub-minimum prompts (below the model’s token floor — §1 table): writes silently skipped; cache_control is a no-op. Pad only if the prompt is reused across requests.
Fan-out batch processing (10k independent requests in parallel, none reuses): cache is dead weight.

Gate rule (from Move 8): expected_hits_per_write × 0.9 > write_premium. Below breakeven (≈0.28 reuses for 5m, ≈1.11 for 1h), don’t cache.

6. Failure Modes (Reverse Playbook)

Every cache-miss regression traces back to violating one move. Use this as a diagnostic index:

Symptom	Violated move	Fix
Zero `cache_read_input_tokens` on turn 2	Move 1 or 3	Verify breakpoint present; audit tool + system bytes
Hit rate collapses once conversation grows past 20 blocks	Move 4	Anchor slot 2 on the compaction boundary so lookback can still reach the head
Hit rate collapses immediately after compaction	Move 5 or 6	Check that compaction is append-only + whole-block
Hit rate 0.2-0.4 on every turn	Move 2 or 3	Hunt non-determinism in tool / message serialization
Turn 1 `cache_creation_input_tokens = 0`	Below-min	Pad system prompt or accept the cost
Post-compaction step always a full miss	Move 8	TTL too short for compaction cadence
Hit rate collapses after user sends mid-task text	Thinking	Route mid-task input through `tool_result` if possible
Hit rate varies sharply by `tool_choice`	Move 2	Pin `tool_choice` for the session

7. Measurement

Cache hit ratio — the headline metric

hit_ratio = cache_read_input_tokens
          / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)

Targets on a well-behaved long task: > 0.85 after turn 3, > 0.90 on sustained multi-turn sessions. Below 0.5 is a bug, not a tradeoff.

The `cache.breakpoints_placed` event

Dashboards worth building:

Breakpoint count histogram — should cluster at 3 or 4. A mode at 1 means only the system prompt was cached.
extraBreakpointUsed ratio by task age — should rise toward 1 as tasks cross the compaction threshold.
Adjacent placedAt gap distribution — adjacent marks should stay under 20 blocks.

Cold-start verification

# Turn 1
assert response.usage.cache_creation_input_tokens > 0    # Move 1 works
assert response.usage.cache_read_input_tokens == 0
# Turn 2
assert response.usage.cache_read_input_tokens > 0        # Move 2/3 holding

Turn 2 with zero reads means the prefix changed between turns — walk the failure-mode table above.

8. Pre-Ship Checklist

Move 1 — cache_control on last system block; slot 1 caches tools + system.
Move 2 — tool list computed once per session; JSON key order pinned.
Move 3 — no timestamps, request IDs, or unstable map iteration in any cached block.
Move 4 — slot 2 anchored on the compaction boundary (or length-midpoint fallback).
Move 5 — compactor does not mutate blocks before slot 1.
Move 6 — compaction replaces whole rounds / whole blocks; truncation length is fixed.
Move 7 — last user message and last message each carry a breakpoint.
Move 8 — TTL matches compaction cadence (5m for tight loops, 1h for long-running).
Telemetry — cache.breakpoints_placed emitted; cache_read_input_tokens captured; hit ratio dashboarded.
Turns 1 and 2 verified — writes on turn 1, reads on turn 2.

← Compaction — The summarisation layer whose stable round boundary is what Move 4 depends on.
Context Management — The attention budget; cache hit ratio is its cost dimension.
Memory Design — What lives outside the window does not enter the cache epoch and cannot invalidate it.
Operations / Dashboards — Where cache.breakpoints_placed lands.

Sources

Prompt caching — Anthropic, Claude API docs. Primary reference for the block model, 20-block lookback, invalidation rules, and pricing.
Effective Context Engineering for AI Agents — Anthropic, 2026. Cache-aware compaction guidance.
packages/backend/src/agent/model.ts — Zapvol’s applyCacheControl / createCachedInstructions.
packages/backend/src/agent/agent-round.ts — the caller threading the compaction boundary into the cache layer.