Writing Skills
The process of authoring, testing, and iterating on a SKILL.md — how to choose a pattern, write the description, and make it reliably trigger
Overview introduces what skills are. 5 Skill Design Patterns catalogs the shapes a skill can take. This page is about the process: how to actually write one, verify it works, and improve it without overfitting to the examples in front of you.
The Authoring Loop
Writing a skill that survives contact with real users is iterative. You draft a version, run it against realistic prompts, look at the outputs, and improve. Most of the leverage lives in the loop — the first draft rarely matters.
Five activities shape one artifact:
- Capture intent — happens once per skill. What should the skill enable, when should it trigger, what’s the expected output?
- Draft — pick a pattern from design patterns and fill in the SKILL.md.
- Test — run 2–3 realistic prompts with the skill loaded, ideally alongside a baseline without it.
- Review — read the outputs qualitatively; track objective assertions quantitatively.
- Improve — generalize from what you learned without overfitting the specific examples, then loop back to step 3.
Keep each iteration cheap. A loop that takes minutes lets you try bold changes and throw them away; a loop that takes hours pressures you to commit early to a design that may not survive.
1. Capture Intent
Before writing anything, answer four questions:
- What should this skill enable the agent to do? State the capability, not the implementation.
- When should it trigger? What phrases, contexts, or file types? This becomes the description.
- What’s the expected output? A file format, a section structure, a field layout.
- Does the output benefit from automated tests? Objective outputs (file transforms, data extraction, code generation, fixed workflow steps) do. Subjective outputs (writing style, creative work) usually don’t — qualitative review works better there.
If the user is already mid-conversation and says “turn this into a skill,” extract answers from the conversation first — the tools used, the sequence of steps, the corrections they made. Ask only about the gaps. Confirm before proceeding.
2. Drafting SKILL.md
Frontmatter
---
name: skill-name
description: One or two sentences covering both what the skill does AND when to use it.
---
Only name and description are required. Everything else is optional and rarely needed.
Shape selection. Pick a pattern from the 5 design patterns as a starting
point. Most skills start as Tool Wrapper (instructions + references/) and graduate to heavier patterns only when
iteration signals demand it — see Pattern Evolution below.
Body length
Keep the body under ~500 lines. If you’re approaching that, split depth into references/ and add a short index in
SKILL.md pointing to where to read next. Large reference files (>300 lines) should include a table of contents so the
agent can jump to the relevant section rather than loading the whole thing.
Multi-variant skills
When a skill serves multiple variants (AWS / GCP / Azure; Python / TypeScript / Rust), organize by variant and let the SKILL.md router decide:
cloud-deploy/
├── SKILL.md # workflow + selection rules
└── references/
├── aws.md
├── gcp.md
└── azure.md
The agent reads only the variant it needs. Each additional variant costs one small reference file, not a branch through a monolithic SKILL.md.
3. Writing Style
Four rules that matter more than they sound.
Prefer imperative. Extract the error code from the stack trace reads cleaner than
You should extract the error code. Treat the agent like a competent colleague who just joined the project, not like a
user you’re instructing.
Explain the why. LLMs have good theory of mind. Telling them “we track session tokens in sessionStore because
middleware X reads from it during request context setup” is more robust than “ALWAYS write session tokens to
sessionStore.” When the agent hits an edge case the skill didn’t anticipate, understanding the reason lets it
generalize; a bare rule just breaks.
Avoid all-caps MUSTs. If you’re writing ALWAYS or NEVER in all caps, or leaning on a rigid scaffold to force
behavior, that’s a yellow flag. Reframe with the reasoning and you’ll usually get the same outcome more reliably.
Generalize, don’t overfit. You’re writing for a million future invocations, not the three test prompts in front of you. When a test fails, resist the urge to add one more MUST clause targeting that specific case. Step back — what’s the class of issue this represents?
4. Description Tuning
The description field is the single most important line in a SKILL.md. It’s what the agent sees at L1 metadata load — the 100-token hook that decides whether the skill even gets considered for the current turn.
Agents undertrigger by default
In practice, agents tend to not use skills even when they would help. The model sees the description, judges “I can handle this directly with my base tools,” and skips the skill. Counter this by making the description slightly pushier than feels natural:
Weak:
How to build a dashboard to display internal data.
Better:
How to build a fast dashboard to display internal data. Use this skill whenever the user
mentions dashboards, data visualization, internal metrics, or wants to display company data —
even if they don't explicitly ask for a "dashboard."
The second version lists trigger phrases and nudges the agent toward activation. It’s not about being dishonest; it’s compensating for the agent’s bias toward skipping skills.
What makes a good description
- Specific trigger phrases — the actual words users type, not abstractions
- Scope signal — what’s in and what’s out, so the agent doesn’t fire on adjacent tasks
- Context markers — file types, domains, tools the user would mention
Evaluating trigger accuracy
If you want to tune the description rigorously, generate 20 realistic queries — 10 that should trigger and 10 that shouldn’t. The hard ones to write are the should-not-triggers: they need to be near-misses, sharing keywords with the skill but actually needing something else. “Write a fibonacci function” as a negative test for a PDF skill is too easy — it tests nothing.
Measure the trigger rate (run each query multiple times, since the model isn’t deterministic) and revise the description until it hits both axes: high on positives, low on negatives.
One caveat: simple one-step queries like “read this PDF” may not trigger a skill even with a perfect description — the agent can handle them directly. Your evaluation queries should be substantive enough that consulting a skill is genuinely worthwhile.
5. Testing and Iteration
When to invest in assertions
Test cases pay off when the skill has an objectively verifiable output:
| Output type | Assertion style |
|---|---|
| File transforms (docx, xlsx, csv) | Structure, content, counts |
| Data extraction | Field-level equality |
| Code generation | Compilation + unit tests |
| Fixed workflow steps | Presence and order of steps in the output |
For subjective outputs — writing tone, design quality, explanation clarity — skip automated assertions and rely on qualitative review. Forcing assertions onto subjective work produces either trivial checks that pass regardless of quality, or brittle ones that fail on aesthetic variation.
Running iterations
Each cycle:
- Run the skill against your test prompts, ideally with a baseline (no-skill) run alongside to see what the skill adds
- Review outputs — what’s off, what’s surprising, what’s missing
- Apply one or two targeted improvements, not a dozen
- Rerun and check whether the fix generalized or just patched the single failing case
Iteration heuristics
Read transcripts, not just final outputs. If the agent took eight turns where two would have sufficed, the skill is making it wander. Look for unproductive loops — they’re where the leanest wins hide.
Bundle repeated work into scripts/. If three independent test runs each wrote a similar helper script, the skill
should ship that script. Write it once, place it in scripts/, point the skill to it. Future runs skip the
re-derivation.
Keep the prompt lean. Cut instructions that aren’t earning their weight. More words don’t mean more reliability — they often mean more ways to misfire.
When feedback is terse, look harder. A grumpy one-line user comment (“this is wrong”) is a signal, not a summary. Get into their head, figure out what specifically failed, and fix the underlying pattern rather than the surface complaint.
6. Pattern Evolution
Iteration doesn’t just improve the content of a skill — it sometimes changes the shape. A Tool Wrapper that keeps re-deriving the same template is asking to become a Generator. A Generator whose outputs drift in step order is asking to become a Pipeline.
Two signals to watch for:
- Repeated structure across outputs → promote the structure into
assets/template.*. The skill migrates from Tool Wrapper to Generator. - Unenforced step ordering or validation → add
scripts/with gates and a stepwise SKILL.md body. The skill migrates to Pipeline.
The inverse is equally valid. If you added scripts/ once and never used them, delete them. Structure that isn’t
earning its maintenance cost should shrink. The point of iteration is not to climb the ladder — it’s to stop at the rung
that’s actually load-bearing.
Anti-Patterns
- Writing the skill and declaring victory. A skill you haven’t tested is a hypothesis. Run at least 2–3 realistic prompts before shipping.
- Overfitting to the test prompts.
IF user asks about Q4, THEN ...style rules patch the failing example and break generalization. - All-caps MUST blocks. Strong-arming the model with capitalized imperatives usually does less than explaining the reason.
- Vague descriptions. A skill with an abstract description won’t trigger reliably, no matter how good the body is.
- Monolithic SKILL.md. If it’s over 500 lines with no structure, the agent loads all of it every time. Split depth
into
references/.
Related
- Overview — what Agent Skills are and how progressive disclosure works
- 5 Skill Design Patterns — the structural templates referenced above
- agentskills.io specification — the SKILL.md format
- Anthropic’s skill-creator — the source for much of this methodology