From Case to Paradigm

From Observation to Action

The case study walked through 47 design moves in a real production prompt. That was observation. This page is action — the same material turned into a procedure you can apply to your own prompts.

Two parts:

A 10-step design method, distilled from the case study and the earlier theory pages. Each step names what to do, why it matters, and how to check you’ve done it.
The elevation: when a single monolithic prompt stops scaling, the same 10 steps generalize to composed design — prompts assembled at runtime from layers. Most of the invariants survive unchanged; a few change form; composed design introduces its own new problems.

The goal is that a reader closing this page can start writing, then know when to stop writing prose and start writing layers.

Part 1 — The Method

Ten steps, roughly in the order you write them.

Step 1 — Declare Role, Medium, and Power

Do: write three short sentences. Who is the agent, what medium does it produce, and what is its relation to the user (peer? assistant? manager’s report?).

Why: concrete declaration pervades every downstream decision — tone, vocabulary, willingness to act. An abstract role (“helpful assistant”) forces the model to infer specifics; a concrete role (“senior reviewer for a TypeScript backend, reports to the user”) does not.

Check: a new engineer reading your first three sentences can answer “what is this agent for, and who is it serving?” without asking.

See prompt-design → role is a disproportionate lever; case study Domain 1.1–1.2.

Step 2 — Fill the Skeleton Anatomy

Do: cover six functions — Role / Capabilities / Behavioral Rules / Workflow / Output Standards / Escape Hatch. Missing any of these is a known source of runtime breakage.

Why: prompts fail most often not because a rule was wrong but because a rule was absent. The skeleton is a checklist of what every prompt needs, regardless of task.

Check: point at the single line that handles “what does the agent do when confused?” If you can’t, your escape hatch is missing.

See prompt-design → prompt anatomy; case study Domain 1.5.

Step 3 — Red Lines With Triggers and Counterparts

Do: for every security-critical or reputation-critical rule, write three parts: the prohibition, a self-trigger the agent can detect mid-generation, and a positive alternative that gives the model something to produce instead.

Why: “don’t X” alone is aspirational — the model has nothing to do with the instruction. “If you find yourself saying X, stop” is operational. And an alternative prevents the deadlock where prohibition alone gives the model no exit.

Check: for each red-line rule, can the agent recognize its own output against the rule during generation, and does it know what to produce instead?

See case study Domain 1.3–1.4; prompt-design → tell what to do.

Step 4 — Choose Markdown vs XML Deliberately

Do: Markdown for the prompt’s own sections (role, workflow, rules). XML tags for content the model must treat as data, not instruction — documents it’s analyzing, few-shot examples, user-provided payloads.

Why: the model needs to know the difference between “what I’m asking you” and “what I’m asking you about”. XML boundaries give it an unambiguous signal; all-Markdown or all-XML blur the line.

Check: point at every XML tag in your prompt. Each one should be wrapping data the model is consuming, not instruction it should follow.

See prompt-design → structure: XML vs Markdown.

Step 5 — Install an Emphasis Hierarchy

Do: designate three tiers — plain prose (default), bold (scan-to), CRITICAL (non-negotiable). Reserve CRITICAL for rules with production-incident history. Every CRITICAL pairs with reason + alternative.

Why: emphasis is a finite resource. A prompt full of NEVER and MUST tunes the model out; one with CRITICAL appearing 3-4 times keeps it loud. Reason + alternative lets the model generalize to cases you didn’t enumerate.

Check: count CRITICAL markers in your final prompt. In a 300-line prompt, over ~5 is suspect. Every one should have a reason and a “do this instead”.

See case study Domain 2.1–2.2.

Step 6 — Name the Defaults You Want Overridden

Do: observe your model’s outputs across many realistic tasks. Note what it reaches for when unconstrained — the cliché phrasings, the default design choices, the safe-but-wrong inferences. List them as explicit anti-patterns with positive alternatives.

Why: generic “avoid bad X” instructions are ignored. Specific named defaults (“avoid gradient backgrounds, avoid Inter/Roboto, avoid decorative emoji”) override. This only works if you’ve actually seen the outputs — you can’t name defaults you haven’t observed.

Check: can you trace each named anti-pattern back to a specific output you didn’t like? If not, you’re guessing.

See case study Domain 3.1–3.5; prompt-design → measuring it.

Step 7 — Order for Caching: Stable First, Dynamic Last

Do: arrange prompt blocks from most stable to most dynamic. Tools → identity → long-lived context → durable examples → message history → current turn.

Why: prompt caching hits only on the stable prefix. A dynamic token early in the prompt invalidates every prefix after it — and the prompt gets re-read on every turn, so cache misses compound fast.

Check: identify the exact character where the cached prefix ends. Everything before it should be content that does not change turn-to-turn.

See prompt-design → caching-aware ordering.

Step 8 — Skills and Tools as a Registry, Not a Dump

Do: list available capabilities (skills, starter components, optional tools) by name plus a one-line description. Full instructions load on demand through a tool call.

Why: the index needs to be in the prompt so the agent knows what exists; the full content usually doesn’t — loading every skill’s prompt would be thousands of lines of attention tax, most of it unused per task.

Check: is your “available capabilities” section under ~30 lines regardless of how many capabilities exist? If it’s scaling linearly, you’re carrying weight.

See context-management → just-in-time context; case study Domain 5.1.

Step 9 — Install the Measurement Loop

Do: build a fixed evaluation set — 20 to 50 realistic inputs with expected behaviors — before you start iterating. Run it against each prompt revision. Track which cases pass, which fail, and which have shifted failure modes.

Why: without measurement, prompt iteration is a random walk. Adding rule #47 to fix the latest edge case breaks rules #1-46 more often than you’d think — and you won’t see the breakage without the suite.

Check: right now, can you state your current prompt’s pass rate on a fixed eval set? If not, you’re guessing.

See prompt-design → measuring it; overview’s “A Note on Measurement”.

Step 10 — Write the End-of-Turn Discipline

Do: explicitly specify what end-of-turn output should look like — format, length, content. Use prescriptive language, not aspirational.

Why: models default to verbose summaries. A rule saying “be concise” is aspirational; a rule saying “only caveats and next steps” is prescriptive. Prescriptive wins.

Check: is your end-of-turn rule concrete enough that two readers would write the same response to a simple request?

See case study Domain 4.7.

Part 2 — Where the Method Stops Scaling

The 10 steps above assume one prompt, one agent, one task type. This assumption holds until it doesn’t. A few symptoms announce that a monolithic prompt has hit its ceiling:

The tool set varies per task. Different kinds of requests need different tools. A single prompt has to explain the if/else — “when doing X, use these three; when doing Y, use those four” — and every user’s turn pays the attention tax for tools they’re not using.
There are multiple agent roles. Main agent, subagents, verifiers. Each needs a different identity core but shares some policies. A single prompt either forces all of them into one persona (the main agent becomes the verifier’s prompt too, awkwardly) or splits into N files with no shared base.
Sections need independent versioning. You want to A/B test just the workflow section without touching the safety rules. You want to update the output standards without risking regressions in the red-line enforcement. A monolithic file makes this a copy-paste exercise with no clean rollback.
Multi-tenant needs differ. Different users, plans, or regions need different constraints. Static prompts either show everyone everything or fork into per-tenant prompts — both scale badly.
The prompt has outgrown one sitting. You can no longer read it end-to-end without losing track. CRITICAL markers are losing salience because there are now too many. Rules are starting to contradict.

The root cause across all of these: the prompt is a static file when the situation calls for a function. A function takes the current task, user, tier, agent role, tool set, and produces the prompt that fits. The monolithic approach tries to encode that function as branching prose; eventually the prose can’t keep up.

Part 3 — Elevating to Composed Design

When the monolithic model stops scaling, the shift is architectural: the system prompt becomes a function, not a file. At each turn, the prompt is assembled from a set of layers, each contributing a piece.

A Typical Layer Decomposition

Any composed system usually has a shape like this — the names vary but the functions recur:

Layer	Contributes	Varies by
Identity	Role, behavioral rules, red lines, policies	Agent variant (main / subagent)
Tool prompts	One instruction block per enabled tool	Tier / tool availability
Memory layer	Index of persistent memory relevant to this user or session	User / session
Environment	Date, sandbox state, tenant, session metadata	Every turn
Variant core	Kind-specific instructions (main vs subagent vs team member)	Agent role

The system prompt the model sees is the concatenation of these layers, assembled per turn. No layer is optional in principle; some can be empty.

What Stays the Same

Most of the 10-step method transfers unchanged. The steps are now properties of the assembled prompt, not of a single file:

Step 1 — Role / medium / power. Lives in the identity layer. Concrete role still beats abstract role.
Step 2 — Skeleton anatomy. Now distributed across layers. Role + rules in identity; capabilities in tool prompts; workflow in identity or variant.
Step 3 — Red lines with triggers and counterparts. In the identity layer; unchanged in form.
Step 5 — Emphasis hierarchy. The policy “CRITICAL is scarce” applies per layer and across the whole assembly. CRITICAL used twice in identity and twice in a tool prompt is already four — scarcity has to be global.
Step 6 — Named defaults. In identity. Observing the model’s defaults is the same work whether the prompt is monolithic or composed.
Step 9 — Measurement. Unchanged conceptually. You eval the assembled prompt, not individual layers — though tracing which layer caused a regression is now a new job.

What Changes Form

Four of the ten steps look different when the prompt is composed:

Step 4 — Markdown vs XML. Now has two levels: structure within each layer, and structure between layers. Between-layer separation is often just stable markdown headers; within-layer structure follows the same rules as before.
Step 7 — Caching-aware ordering. No longer about character positions in a file. Instead, each layer declares its own stability — some layers are cached across tasks (identity, tool prompts), some across a session (memory layer), some never (environment). The assembly code places stable layers first and volatile ones last, but the declaration lives with the layer, not with the assembler.
Step 8 — Skills registry. In a composed system this is sometimes not a section of the prompt at all; it IS the tool prompts layer. A tool that is enabled contributes its own block; a disabled tool contributes nothing. The registry is implicit in the layer’s presence or absence.
Step 10 — End-of-turn discipline. Often goes in the variant layer, because main agents and subagents have different output contracts. A main agent writes a user-facing summary; a subagent writes a structured result for the parent to consume. Same principle, different specifics per variant.

New Problems Composed Design Introduces

Composed design is not free. It creates a class of problems monolithic prompts don’t have:

Ordering contract. Layers must be concatenated in a deterministic order. Who declares that order? Is it a config, a convention, or hard-coded in the assembler? Every real system has to answer this.
Conflict resolution. If the identity layer says “always verify with the user” and a tool prompt says “act immediately”, which wins? A composed system needs either (a) a precedence rule, (b) a policy that layers can’t contradict identity, or (c) a pre-assembly lint check.
Versioning. Each layer now has its own release cycle. Rolling back a bad change to the identity layer shouldn’t undo unrelated improvements to tool prompts. Git-level versioning at the file level works poorly; you need some layer-level abstraction.
Testing. Your eval set has to cover combinations. A prompt that passes with “identity + tools A+B” may fail with “identity + tools A+C”. The combinatorics grow fast; good systems narrow by eval tier coverage, not by exhaustive enumeration.
Debuggability. When the agent behaves badly, which layer caused it? Monolithic failures point at a single file; composed failures require the assembled prompt to be logged or reconstructed from the layers that were active at the time of the failure.

These problems are the cost of the elevation. They’re worth paying when the symptoms in Part 2 are real; they’re unnecessary overhead otherwise.

A Note on Choice

Not every prompt should be composed. Composition is an engineering choice with real cost:

Single agent, stable tool set, small team — monolithic is the right answer. Composition adds overhead you won’t recoup.
Multi-tier / multi-subagent / dynamic tools / multi-tenant — monolithic will eventually hit the symptoms in Part 2. Start composed from the beginning if you can, or plan the migration.
Uncertain — stay monolithic until the symptoms show up. Premature composition is one of the most common prompt-engineering mistakes. The symptoms are loud when they arrive; you won’t miss them.

The clearest signal to elevate is when you find yourself manually reassembling — copy-pasting different versions of the prompt for different modes, or maintaining two near-identical prompts that diverged on one section. That’s the moment the prompt wants to be a function instead of text.

Composition is also reversible. Some systems start composed, discover they only have one variant in practice, and flatten back to monolithic. Either direction is fine as long as the choice is driven by measurement, not by aesthetics.

Summary

The 10-step method gives a procedure for writing a single good prompt. It is enough for most agents on day one.
The method’s invariants (role, altitude, emphasis hierarchy, named defaults, caching, measurement) survive the elevation to composed design unchanged in principle, changed in form for four of the ten.
The elevation is an architectural choice with real cost. Make it when the symptoms demand it; stay monolithic when they don’t.
Across both architectures, the underlying principle is the same: the smallest set of high-signal tokens that maximizes the likelihood of the desired outcome. The architecture just determines where those tokens come from.

← Overview — Return to the section hub.
Case Study: Claude’s Design Prompt — The source material. This page is the method; that page is the worked example it was distilled from.
Prompt Design — The theory behind the individual steps.
Context Management — The runtime layer that consumes the assembled prompt; Step 7 (caching) and Step 8 (JIT registry) live at the boundary between prompt design and context management.

Sources

Effective Context Engineering for AI Agents — Anthropic, 2026
Prompting best practices — Anthropic, Claude API docs
CL4R1T4S: ANTHROPIC/Claude-Design-Sys-Prompt.txt — the monolithic reference point for Part 1