Overview

From Prompt Engineering to Context Engineering

Prompt engineering is the craft of writing a single high-quality instruction: picking the right words, the right structure, the right examples to steer a single model call. It remains useful — but it is no longer sufficient.

Context engineering is the broader discipline. It asks a different question:

“What configuration of context is most likely to generate the model’s desired behavior?” — Anthropic, Effective Context Engineering for AI Agents

An agent is not a single call. It is a loop: the model reads context, calls tools, receives results, reasons, and continues — often for dozens or hundreds of turns. Every turn, the context window is re-read in full. What lives in that window, how it got there, when it leaves, and what replaces it — these choices compound across the entire run. Prompt engineering optimizes one call; context engineering optimizes the entire information lifecycle.

The shift matters because the failure modes are different. A bad prompt produces a bad answer. A bad context strategy produces an agent that starts coherent and decays — losing track of earlier decisions, contradicting itself, forgetting the user’s goal ten turns ago. The second failure is invisible until the agent crosses some threshold. By then, the conversation is unrecoverable.

The Attention Budget

Language models have finite working memory. Research on long-context benchmarks — “needle in a haystack” retrieval, multi-document QA, and agentic coding evaluations — reveals a phenomenon Anthropic calls context rot:

As the number of tokens in the context window increases, the model’s ability to accurately recall specific information decreases.

This is not a hard cliff at some specific token count. It is a gradient: the model’s effective precision degrades gradually long before the context is technically “full”. The implication is not “use smaller contexts” but:

Find the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome.

Every token in the window competes for attention with every other token. Low-value tokens — stale tool output, repeated boilerplate, speculation that went nowhere — do not merely occupy space. They actively dilute the model’s focus on high-value tokens. Context engineering is the discipline of maximizing signal density.

Two consequences follow:

More context is not always better. Pre-loading every “potentially useful” document can measurably hurt performance. A 100k-token context stuffed with weak signals often loses to a 10k-token context of carefully chosen ones.
Context has a cost. Tokens are billed. Caching amortizes that cost, but only for the stable prefix (see prompt design for why this shapes prompt layout).

Three Pillars

Context engineering is a large design space. It is easier to think of it as three pillars that each answer a different question:

Pillar	Question	Concerns
Prompt Design	What is the static skeleton the agent always sees?	Altitude, structure, examples, tool definitions, caching
Memory Design	What survives outside the window, to be retrieved when relevant?	Memory taxonomy, write/recall policy, structured notes
Context Management	How does the runtime prune, replace, and reload information over a loop?	Budgets, just-in-time, sub-agent isolation, checkpointing

One operation inside the third pillar is substantial enough to warrant its own chapter: Compaction — what fires when the runtime can no longer keep the window within budget. Triggers, spectrum, preservation, tuning, multi-agent coordination, and a cross-framework comparison live there. It sits between Memory and Context Management in the reading order because it describes what happens when the content layer overflows, before the runtime layer that has to orchestrate around it.

There is also a cost-side concern that cuts across all three pillars and gets its own chapter: Cache Point Design — prompt cache hit rate directly determines how much a long-running agent actually costs per token. It is not a fourth pillar; it is a cross-cutting topic that depends on Pillar 1’s prompt block layout and the compaction spectrum. It comes immediately after compaction in the reading order, because compaction design directly determines whether the cache can hit.

The three pillars are not independent:

Memory writes are loaded by context management as the loop progresses; good memory is useless if the context manager never retrieves it.
Tool definitions are part of the prompt, but their return shape is governed by context management (how much of each result stays in the window, how it is truncated).
Sub-agents inherit a fresh prompt, use their own memory channel, and compose with compaction when they return — all three pillars are involved in a single delegation.

Treat the pillars as a vocabulary for reasoning about trade-offs, not as orthogonal modules.

The Guiding Principle

Every technique in the following pages — XML tagging, example selection, memory types, just-in-time loading, compaction cascades, sub-agent architectures — is a local instance of one principle:

Identify the minimal set of high-signal tokens that maximize the likelihood of the desired outcome at every step of the loop.

As models improve, the context window grows and attention quality increases. Some scaffolding built for weaker models will become unnecessary. But the underlying constraint does not go away: context remains a finite resource with real opportunity cost. Better models create room for more ambitious tasks, which in turn fill the window with new kinds of information. The tension does not disappear — it shifts shape.

This principle is the single lens through which to read the three pillars. When a technique is confusing or the trade-offs feel arbitrary, come back to it.

A Note on Measurement

Each of the three pillars has distinct things worth measuring — prompt regressions, memory hit rates, context occupancy. Treat them as engineering artifacts, not craft. Each pillar page ends with a short “Measuring It” section that names the signals specific to that pillar.

The meta-principle across all three: without a regression harness, every change is an opinion. The smallest useful harness is usually a fixed set of realistic inputs plus expected behavior — 20-50 cases is enough to start, and it’s enough to catch most of the failures that show up in production.

Reading Guide

Read in sidebar order; below the seven body docs are grouped into three clusters that explain why the order is what it is.

Three pillars — the core design space of agent context engineering:

Prompt Design — The static skeleton. How to find the right altitude, structure instructions so the model can parse them, use examples as the primary steering tool, and order prompt blocks so caching actually helps.
Memory Design — What persists outside the window. A taxonomy of memory types (episodic, semantic, procedural, working), when to write vs. when to retrieve, structured notes as durable planning state, and how memory invalidates.
Context Management — The runtime discipline that integrates everything above. Attention budgets, just-in-time loading, progressive disclosure, sub-agents as a context-engineering tool, checkpointing, and failure modes.

Two deep dives — sub-topics heavy enough to deserve their own chapters, slotted between Pillar 2 and Pillar 3 in the sidebar:

Compaction — Pillar 3’s overflow handling sub-topic. What fires when the window can no longer hold what it needs: the compression spectrum, trigger strategies, preservation policy, custom instructions, design extensions (multi-agent / caching / recovery), and a cross-framework reference. Comes before runtime composition because it describes what happens when the content layer overflows.
Cache Point Design — The cross-cutting cost topic that determines per-token cost. Block model, four breakpoint placements, eight design moves, knock-on effects of compaction on cache. Depends on Pillar 1’s prompt layout and the previous chapter’s compaction strategy.

Two capstones — turning theory into practice:

Case Study: Claude’s Design Prompt — a close reading of a real ~340-line production system prompt, naming every design move worth learning from. Theory in action.
From Case to Paradigm — the method distilled out of that reading: a 10-step design procedure, plus how the same invariants extend from a single monolithic prompt to composed, runtime-assembled architectures.

If you are looking for Zapvol’s specific implementation (compaction tiers, memory service, prompt registry), see Architecture — this section is the design theory, not the code walkthrough.

Sources

Effective Context Engineering for AI Agents — Anthropic, 2026
Prompting best practices — Anthropic, Claude API docs
Building Effective Agents — Anthropic, 2024
Building LLM Applications for Production — Chip Huyen, 2023, early and still influential framing of the prompt-plus-context problem