Prompt Design

The Static Skeleton

The system prompt is the part of context the agent always sees. Unlike message history (grows) and tool results (come and go), the skeleton is written once and re-read on every turn. That asymmetry changes how you design it:

Every token you add is paid for on every turn — cheap with prompt caching, not free without it.
Tokens that land in the skeleton compete for attention with every user message and tool result to come. A bloated skeleton silently degrades every future turn.
What you put here is the one part of context you fully control. Everything else is shaped by what the user asks and what tools return.

The skeleton’s job is to encode the agent’s role, capabilities, and behavioral defaults — compactly, and at the right level of abstraction.

Right Altitude

The most common system-prompt failure is choosing the wrong altitude of specification. Anthropic names two failure modes:

Avoid complex, brittle logic in their prompts to elicit exact agentic behavior. […] Avoid vague, high-level guidance that fails to give the LLM concrete signals. — Effective Context Engineering for AI Agents

The target is in between: specific enough to guide behavior reliably, flexible enough that the model can handle cases the prompt didn’t anticipate.

A quick test: can an external change (new edge case, new tool, new user phrasing) be handled without rewriting the prompt?

Too specific: “If the user says migrate, call db_migrate. If the user says import, call data_import.” Brittle — the eleventh verb breaks the prompt.
Too vague: “Help the user with database tasks.” Useless — the model will hallucinate what “help” means on anything non-trivial.
Right altitude: “You perform database administration. When the user requests a schema change, inspect the current schema, propose a migration, get confirmation, then apply. Never drop data without explicit confirmation.”

The right-altitude version states what to do, in what order, and where the bright lines are — without enumerating every verb or every table.

Heuristic: write a prompt a new engineer could follow. If they would ask “but what about X?” for every realistic X, you are too vague. If they could follow it without thinking, you are too specific.

Prompt Anatomy

A workable system prompt has these sections, in roughly this order:

Section	Purpose	Example content
Role / identity	Who the agent is, what it is for	”You are a code review assistant for the Zapvol codebase.”
Capabilities	What the agent can do — usually framed as available tools	”You can read files, run tests, and propose patches.”
Behavioral rules	Non-negotiable defaults — priority when rules conflict	”Never modify files outside the working directory.”
Workflow	Typical flow for a task — not a rigid script	”Read related files before proposing. Run tests before reporting done.”
Output standards	Format, tone, length expectations	”Report in terse prose. Cite file:line for every claim.”
Escape hatch	What to do when the model is confused	”If the user’s intent is unclear, ask one clarifying question.”

Two principles govern the anatomy:

Prefer policy over enumeration. Instead of listing every case, give the model a small policy it can apply:

Not: “If X do A. If Y do B. If Z do C.”

Better: “Distinguish read-only queries from destructive ones. For destructive ops, always confirm first.”

State constraints once, at the highest applicable level. If a rule applies to every tool, put it in Behavioral Rules — not inside each tool’s description. Repetition in the skeleton is attention tax.

Role Is a Disproportionate Lever

A single sentence declaring who the model is shifts tone, vocabulary, depth of detail, and willingness to act — usually more than an equivalent-length rule lower in the prompt. This is worth isolating because it is easy to write a long prompt and forget to state the role.

Compare:

No role: “Help the user with their code.”
With role: “You are a senior code reviewer for a TypeScript backend. Your reviews focus on correctness, not style.”

The second version steers the model’s output at every downstream decision — what it chooses to flag, how terse it is, whether it proposes changes or asks questions. You do not need to repeat “as a senior reviewer” throughout the prompt; the opening role pervades.

Two rules of thumb:

One role, concrete. “Senior code reviewer for a TypeScript backend” beats “helpful assistant”. Abstract roles give abstract behaviors.
Don’t stack roles. “You are a code reviewer, product manager, and copywriter” produces a confused average. If one agent needs multiple modes, switch modes explicitly per turn or use separate agents.

Structure: XML vs Markdown

Claude is trained to parse both XML tags and Markdown headers. Which to use depends on what you’re separating:

Markdown headers (##, ###) — for the prompt’s own sections: Role, Workflow, Output Standards, etc. Headers give the model a natural table of contents; inside each section, prose works fine.
XML tags — for content the model must treat as data, not instruction. The most common uses:
- <example> / <examples> — few-shot demonstrations
- <document> / <document_content> / <source> — documents you’re asking the model to analyze
- <context> / <input> — variable inputs that change per call
- Any custom tag you want the model to produce in its output: “write your reasoning in <thinking> tags”

The rule of thumb: instructions are Markdown, payloads are XML. XML gives the model an unambiguous boundary between “what I’m asking” and “what I’m asking about”.

For multi-document analysis, Anthropic recommends wrapping each document:

<documents>
  <document index="1">
    <source>report_q3.pdf</source>
    <document_content>{{CONTENT}}</document_content>
  </document>
  <document index="2">
    <source>competitor_brief.md</source>
    <document_content>{{CONTENT}}</document_content>
  </document>
</documents>

Compare the two documents and identify three strategic gaps.

Examples Over Rules

For an LLM, examples are the “pictures worth a thousand words”.

A few good examples often beat pages of rules. Examples are particularly effective for:

Format / structure — demonstrate the shape of the answer once, the model imitates.
Tone — show two or three replies in the target voice; the model will match.
Edge cases — instead of enumerating “what if X, Y, Z”, show examples that collectively cover the decision boundary.

Guidelines for picking examples:

Quality	What it means
Relevant	Mirrors realistic inputs, not toy cases
Diverse	Covers multiple axes — don’t let the model overfit one pattern
Structured	Wrap in `<example>` tags so they are not mistaken for current input
Calibrated	Include borderline cases, not only easy wins — the model learns the line

Three to five examples is usually the sweet spot. Fewer and the model generalizes weakly; more and you are paying for tokens that the first three already taught.

Counter-example when useful. A single example of what NOT to do, labeled clearly, can be worth ten rules — especially for common failure modes. But use sparingly: negation is harder for the model than positive demonstration.

Tell What To Do, Not What Not To Do

Negative instructions are easy to write and weaker than positive ones:

Weaker	Stronger
”Don’t use markdown."	"Respond in plain prose paragraphs."
"Don’t be verbose."	"Keep responses under 100 words unless the task requires more detail."
"Never refuse reasonable requests."	"If a request is ambiguous, pick the most likely interpretation and act."
"Don’t use bullet points for everything."	"Use prose for reasoning; use bullets only when presenting a list.”

The reason: models are trained to produce text. A “don’t do X” instruction still makes X an active concept in context; a “do Y instead” gives the model something to actually generate. Positive framings also survive compaction and fine-tuning better — the model retains the behavior, not just the prohibition.

When negation is genuinely needed, pair it with a positive alternative:

“Never fabricate file paths. If you are unsure, use a search tool to verify first.”

Grounding in Quotes

For tasks where the model must reason over long inputs (a large document, a transcript, a multi-file context dump), ask it to cite the relevant passages first, then answer from those citations. The shape:

<document>{{LONG_INPUT}}</document>

Find quotes from the document that relate to the user's question. Put them in <quotes> tags. Then answer the
question, grounding every claim in a quoted passage.

Why this works:

Attention anchors. The act of producing quotes re-focuses the model on the specific passages, cutting through surrounding noise.
Verifiability. Every claim becomes traceable to a source passage. Hallucinated content that does not match a quote is visibly wrong.
Cheap self-check. Without adding a separate verification step, the quote-first pattern catches most “the model confidently invented something” failures.

The same pattern works for agent tool results: when a tool returns a large output (a long search result, a file dump), have the agent excerpt the relevant lines before acting on them. This is lighter-weight than a full summarization and preserves precise references.

Tool Definitions Are Part of the Prompt

Every tool definition enters the context window. On a modest agent with 15 tools, tool definitions can easily be 3-5k tokens of the skeleton. Three implications:

Tool count matters. Each tool must earn its place. If two tools overlap in purpose, the model spends thinking tokens choosing between them. Merge or remove.
Tool descriptions are prompts. Each description is read on every turn. Be concise, unambiguous, and state both what the tool does and when not to use it.
Parameter names and docs steer the model. search(query: string) is weaker than search(query: string, /** natural-language search phrase, 3-10 words */). The JSDoc-style guidance is the tool’s prompt-within-a-prompt.

Checklist for a well-designed tool:

Name: verb-noun, unambiguous (read_file beats file)
Description: single paragraph, states purpose, lists when to use, lists when not to use
Parameters: each documented; required vs optional clear
Output: predictable shape; error cases explicit
No overlap with another tool. If overlap is unavoidable, one description must actively route away from the other (“use search_code for semantic queries; use grep for exact strings”).

Caching-Aware Ordering

Prompt caching — supported by Anthropic’s API and others — charges 1.25× base for the first pass and 0.1× on hits. Agents re-read the full prompt every turn, so caching routinely pays for itself within a single task. But caching works on prefixes: the cached portion must appear at the start, before anything dynamic.

The right order, from most to least stable:

Tools — usually stable across the whole task
System prompt — stable across tasks for the same agent
Long-lived context — files the user attached at the start, summary of prior sessions
Durable examples — few-shot demonstrations that don’t change per turn
Message history — dynamic
Current user turn — most dynamic

If you put a dynamic timestamp at line 1 of the system prompt, nothing caches. A single volatile token at the top invalidates every prefix after it. Move volatile content to the end; put it inside the user turn if possible.

Implication for prompt writing: resist the temptation to “personalize” the system prompt with per-turn data (user’s current time, recent tool results, last error). That content belongs in message history or tool result context, not in the skeleton.

The Tension With Long-Document Placement

A separate best practice says: when analyzing a large document, put the document at the top of the prompt, before the question. This can improve response quality by up to 30% on complex multi-document tasks.

Caching ordering and long-document ordering conflict when the “long document” changes between calls:

If the document is stable across the task (a repo file, a project brief), it belongs in the cached prefix — early and cached.
If the document is one-off (a report the user just pasted, a fresh search result), placing it at the top of the system prompt invalidates the cache every call. Put it in the user turn instead, with its own <document> wrapper.

The resolution is not “pick one”. It is: let the stability of the content dictate where it goes. Stable-and-long goes in the cached prefix; one-off-and-long goes in the user turn.

Anti-Patterns

Authoring mistakes — common ways prompt writing goes wrong before the agent even runs. For runtime behavior that fails despite a good prompt, see Context Management → Failure Modes.

Anti-pattern	Why it hurts
Defensive rule accumulation	Every failure prompts a new rule; skeleton bloats; attention dilutes. Fix root cause instead.
Conflicting instructions	Rules that contradict each other leave the model to pick; behavior becomes unpredictable.
Over-triggering language	”CRITICAL: You MUST…” was useful on weaker models; on current models it causes over-triggering.
Prompting for something tools should do	”Remember you have a `search_code` tool” is a tool-description problem, not a prompt problem.
Dynamic content at the top	Invalidates caching. Every turn pays full cost.
Pre-optimization	Tuning a prompt on three examples. Without evaluation, you are guessing.

When To Stop Iterating

A prompt is done when:

The measured metric (task completion, correctness, format adherence) is at target.
Failure modes are qualitatively different from before — not the same failure the last revision tried to fix.
Additional rules regress on one axis while fixing another. This is the signal that you have reached the prompt’s altitude ceiling. Further improvement requires a different mechanism — better tools, better examples, better memory — not more prompt text.

The failure mode to avoid: adding rule #47 to handle the latest edge case, without checking whether it breaks rules #1-46 on a diverse evaluation set. Iteration without evaluation is not iteration — it is a random walk.

Measuring It

Prompts are an engineering artifact; they deserve the same measurement discipline as code. A minimal evaluation harness for a prompt:

A fixed set of inputs — realistic, diverse, including the edge cases you care about. 20-50 is enough to start.
Expected behavior per input — either a gold answer, a rubric, or a check function. Doesn’t have to be exact; “must mention X, must not recommend Y” is enough.
A way to run the prompt against all inputs — script, notebook, CI job.
A regression baseline — before changing the prompt, run the suite and save the results. After changing, diff.

Useful signals beyond pass/fail:

Token cost per call — a prompt revision that improves quality at 3× cost may not be worth it.
Cache hit rate — if caching-aware ordering is working, hit rates climb toward 80%+ on repeat task types.
Failure mode shifts — new failures are fine; same failures means the revision didn’t help.

Without this loop, every prompt change is an opinion. With it, each change is a measurable step.

Memory Design — The prompt’s skeleton is one place to put stable information; memory is the other. The two compose: the recall tool’s description lives in the prompt, the memories it retrieves live outside.
Context Management — Decides how the prompt interacts with everything dynamic: tool results, message history, sub-agent traces. A prompt optimized in isolation can be undermined by poor runtime management.

Sources

Effective Context Engineering for AI Agents — Anthropic, 2026
Prompting best practices — Anthropic, Claude API docs
Prompt caching — Anthropic, Claude API docs
OpenAI Prompt Engineering Guide — complementary treatment of role-setting, structured outputs, and few-shot selection
The Prompt Report: A Systematic Survey of Prompting Techniques — Schulhoff et al., 2024, taxonomy of prompting methods across the literature