Context Engineering
The shift from prompt engineering to context engineering — why an agent's attention budget is a finite resource and how the three pillars (prompt, memory, context lifecycle) compose
From Prompt Engineering to Context Engineering
Prompt engineering is the craft of writing a single high-quality instruction: picking the right words, the right structure, the right examples to steer a single model call. It remains useful — but it is no longer sufficient.
Context engineering is the broader discipline. It asks a different question:
“What configuration of context is most likely to generate the model’s desired behavior?” — Anthropic, Effective Context Engineering for AI Agents
An agent is not a single call. It is a loop: the model reads context, calls tools, receives results, reasons, and continues — often for dozens or hundreds of turns. Every turn, the context window is re-read in full. What lives in that window, how it got there, when it leaves, and what replaces it — these choices compound across the entire run. Prompt engineering optimizes one call; context engineering optimizes the entire information lifecycle.
The shift matters because the failure modes are different. A bad prompt produces a bad answer. A bad context strategy produces an agent that starts coherent and decays — losing track of earlier decisions, contradicting itself, forgetting the user’s goal ten turns ago. The second failure is invisible until the agent crosses some threshold. By then, the conversation is unrecoverable.
The Attention Budget
Language models have finite working memory. Research on long-context benchmarks — “needle in a haystack” retrieval, multi-document QA, and agentic coding evaluations — reveals a phenomenon Anthropic calls context rot:
As the number of tokens in the context window increases, the model’s ability to accurately recall specific information decreases.
This is not a hard cliff at some specific token count. It is a gradient: the model’s effective precision degrades gradually long before the context is technically “full”. The implication is not “use smaller contexts” but:
Find the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome.
Every token in the window competes for attention with every other token. Low-value tokens — stale tool output, repeated boilerplate, speculation that went nowhere — do not merely occupy space. They actively dilute the model’s focus on high-value tokens. Context engineering is the discipline of maximizing signal density.
Two consequences follow:
- More context is not always better. Pre-loading every “potentially useful” document can measurably hurt performance. A 100k-token context stuffed with weak signals often loses to a 10k-token context of carefully chosen ones.
- Context has a cost. Tokens are billed. Caching amortizes that cost, but only for the stable prefix (see prompt design for why this shapes prompt layout).
Three Pillars
Context engineering is a large design space. It is easier to think of it as three pillars that each answer a different question:
| Pillar | Question | Concerns |
|---|---|---|
| Prompt Design | What is the static skeleton the agent always sees? | Altitude, structure, examples, tool definitions, caching |
| Memory Design | What survives outside the window, to be retrieved when relevant? | Memory taxonomy, write/recall policy, structured notes |
| Context Management | How does the runtime prune, replace, and reload information over a loop? | Budgets, just-in-time, sub-agent isolation, checkpointing |
One operation inside the third pillar is substantial enough to warrant its own chapter: Compaction — what fires when the runtime can no longer keep the window within budget. Triggers, spectrum, preservation, tuning, multi-agent coordination, and a cross-framework comparison live there. It sits between Memory and Context Management in the reading order because it describes what happens when the content layer overflows, before the runtime layer that has to orchestrate around it.
The three pillars are not independent:
- Memory writes are loaded by context management as the loop progresses; good memory is useless if the context manager never retrieves it.
- Tool definitions are part of the prompt, but their return shape is governed by context management (how much of each result stays in the window, how it is truncated).
- Sub-agents inherit a fresh prompt, use their own memory channel, and compose with compaction when they return — all three pillars are involved in a single delegation.
Treat the pillars as a vocabulary for reasoning about trade-offs, not as orthogonal modules.
The Guiding Principle
Every technique in the following pages — XML tagging, example selection, memory types, just-in-time loading, compaction cascades, sub-agent architectures — is a local instance of one principle:
Identify the minimal set of high-signal tokens that maximize the likelihood of the desired outcome at every step of the loop.
As models improve, the context window grows and attention quality increases. Some scaffolding built for weaker models will become unnecessary. But the underlying constraint does not go away: context remains a finite resource with real opportunity cost. Better models create room for more ambitious tasks, which in turn fill the window with new kinds of information. The tension does not disappear — it shifts shape.
This principle is the single lens through which to read the three pillars. When a technique is confusing or the trade-offs feel arbitrary, come back to it.
A Note on Measurement
Each of the three pillars has distinct things worth measuring — prompt regressions, memory hit rates, context occupancy. Treat them as engineering artifacts, not craft. Each pillar page ends with a short “Measuring It” section that names the signals specific to that pillar.
The meta-principle across all three: without a regression harness, every change is an opinion. The smallest useful harness is usually a fixed set of realistic inputs plus expected behavior — 20-50 cases is enough to start, and it’s enough to catch most of the failures that show up in production.
Reading Guide
Read in order if you are new to the topic; each pillar assumes the previous ones as vocabulary.
-
Prompt Design — The static skeleton. How to find the right altitude, structure instructions so the model can parse them, use examples as the primary steering tool, and order prompt blocks so caching actually helps.
-
Memory Design — What persists outside the window. A taxonomy of memory types (episodic, semantic, procedural, working), when to write vs. when to retrieve, structured notes as durable planning state, and how memory invalidates.
-
Compaction — The operation that fires when the window can no longer hold what it needs: the compression spectrum, trigger strategies, preservation policy, custom instructions, design extensions (multi-agent / caching / recovery), and a cross-framework reference.
-
Context Management — The runtime discipline that integrates everything above. Attention budgets, just-in-time loading, progressive disclosure, sub-agents as a context-engineering tool, checkpointing, and failure modes.
And two capstones that turn theory into practice:
- Case Study: Claude’s Design Prompt — a close reading of a real ~340-line production system prompt, naming every design move worth learning from. Theory in action.
- From Case to Paradigm — the method distilled out of that reading: a 10-step design procedure, plus how the same invariants extend from a single monolithic prompt to composed, runtime-assembled architectures.
If you are looking for Zapvol’s specific implementation (compaction tiers, memory service, prompt registry), see Architecture — this section is the design theory, not the code walkthrough.
Sources
- Effective Context Engineering for AI Agents — Anthropic, 2026
- Prompting best practices — Anthropic, Claude API docs
- Building Effective Agents — Anthropic, 2024
- Building LLM Applications for Production — Chip Huyen, 2023, early and still influential framing of the prompt-plus-context problem