Writing Skills

The process of authoring, testing, and iterating on a SKILL.md — how to choose a pattern, write the description, and make it reliably trigger

Overview introduces what skills are. 5 Skill Design Patterns catalogs the shapes a skill can take. This page is about the process: how to actually write one, verify it works, and improve it without overfitting to the examples in front of you.


The Authoring Loop

Writing a skill that survives contact with real users is iterative. You draft a version, run it against realistic prompts, look at the outputs, and improve. Most of the leverage lives in the loop — the first draft rarely matters.

Authoring Loop for Skills Draft → Test → Review → Improve → Repeat ① Capture Intent once per skill ② Draft pick a pattern ③ Test with-skill runs ④ Review qualitative + metrics ⑤ Improve generalize, don't overfit SKILL.md the artifact you shape run grade feedback revise

Five activities shape one artifact:

  1. Capture intent — happens once per skill. What should the skill enable, when should it trigger, what’s the expected output?
  2. Draft — pick a pattern from design patterns and fill in the SKILL.md.
  3. Test — run 2–3 realistic prompts with the skill loaded, ideally alongside a baseline without it.
  4. Review — read the outputs qualitatively; track objective assertions quantitatively.
  5. Improve — generalize from what you learned without overfitting the specific examples, then loop back to step 3.

Keep each iteration cheap. A loop that takes minutes lets you try bold changes and throw them away; a loop that takes hours pressures you to commit early to a design that may not survive.


1. Capture Intent

Before writing anything, answer four questions:

  1. What should this skill enable the agent to do? State the capability, not the implementation.
  2. When should it trigger? What phrases, contexts, or file types? This becomes the description.
  3. What’s the expected output? A file format, a section structure, a field layout.
  4. Does the output benefit from automated tests? Objective outputs (file transforms, data extraction, code generation, fixed workflow steps) do. Subjective outputs (writing style, creative work) usually don’t — qualitative review works better there.

If the user is already mid-conversation and says “turn this into a skill,” extract answers from the conversation first — the tools used, the sequence of steps, the corrections they made. Ask only about the gaps. Confirm before proceeding.


2. Drafting SKILL.md

Frontmatter

---
name: skill-name
description: One or two sentences covering both what the skill does AND when to use it.
---

Only name and description are required. Everything else is optional and rarely needed.

Shape selection. Pick a pattern from the 5 design patterns as a starting point. Most skills start as Tool Wrapper (instructions + references/) and graduate to heavier patterns only when iteration signals demand it — see Pattern Evolution below.

Body length

Keep the body under ~500 lines. If you’re approaching that, split depth into references/ and add a short index in SKILL.md pointing to where to read next. Large reference files (>300 lines) should include a table of contents so the agent can jump to the relevant section rather than loading the whole thing.

Multi-variant skills

When a skill serves multiple variants (AWS / GCP / Azure; Python / TypeScript / Rust), organize by variant and let the SKILL.md router decide:

cloud-deploy/
├── SKILL.md            # workflow + selection rules
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

The agent reads only the variant it needs. Each additional variant costs one small reference file, not a branch through a monolithic SKILL.md.


3. Writing Style

Four rules that matter more than they sound.

Prefer imperative. Extract the error code from the stack trace reads cleaner than You should extract the error code. Treat the agent like a competent colleague who just joined the project, not like a user you’re instructing.

Explain the why. LLMs have good theory of mind. Telling them “we track session tokens in sessionStore because middleware X reads from it during request context setup” is more robust than “ALWAYS write session tokens to sessionStore.” When the agent hits an edge case the skill didn’t anticipate, understanding the reason lets it generalize; a bare rule just breaks.

Avoid all-caps MUSTs. If you’re writing ALWAYS or NEVER in all caps, or leaning on a rigid scaffold to force behavior, that’s a yellow flag. Reframe with the reasoning and you’ll usually get the same outcome more reliably.

Generalize, don’t overfit. You’re writing for a million future invocations, not the three test prompts in front of you. When a test fails, resist the urge to add one more MUST clause targeting that specific case. Step back — what’s the class of issue this represents?


4. Description Tuning

The description field is the single most important line in a SKILL.md. It’s what the agent sees at L1 metadata load — the 100-token hook that decides whether the skill even gets considered for the current turn.

Agents undertrigger by default

In practice, agents tend to not use skills even when they would help. The model sees the description, judges “I can handle this directly with my base tools,” and skips the skill. Counter this by making the description slightly pushier than feels natural:

Weak:

How to build a dashboard to display internal data.

Better:

How to build a fast dashboard to display internal data. Use this skill whenever the user
mentions dashboards, data visualization, internal metrics, or wants to display company data —
even if they don't explicitly ask for a "dashboard."

The second version lists trigger phrases and nudges the agent toward activation. It’s not about being dishonest; it’s compensating for the agent’s bias toward skipping skills.

What makes a good description

  • Specific trigger phrases — the actual words users type, not abstractions
  • Scope signal — what’s in and what’s out, so the agent doesn’t fire on adjacent tasks
  • Context markers — file types, domains, tools the user would mention

Evaluating trigger accuracy

If you want to tune the description rigorously, generate 20 realistic queries — 10 that should trigger and 10 that shouldn’t. The hard ones to write are the should-not-triggers: they need to be near-misses, sharing keywords with the skill but actually needing something else. “Write a fibonacci function” as a negative test for a PDF skill is too easy — it tests nothing.

Measure the trigger rate (run each query multiple times, since the model isn’t deterministic) and revise the description until it hits both axes: high on positives, low on negatives.

One caveat: simple one-step queries like “read this PDF” may not trigger a skill even with a perfect description — the agent can handle them directly. Your evaluation queries should be substantive enough that consulting a skill is genuinely worthwhile.


5. Testing and Iteration

When to invest in assertions

Test cases pay off when the skill has an objectively verifiable output:

Output typeAssertion style
File transforms (docx, xlsx, csv)Structure, content, counts
Data extractionField-level equality
Code generationCompilation + unit tests
Fixed workflow stepsPresence and order of steps in the output

For subjective outputs — writing tone, design quality, explanation clarity — skip automated assertions and rely on qualitative review. Forcing assertions onto subjective work produces either trivial checks that pass regardless of quality, or brittle ones that fail on aesthetic variation.

Running iterations

Each cycle:

  1. Run the skill against your test prompts, ideally with a baseline (no-skill) run alongside to see what the skill adds
  2. Review outputs — what’s off, what’s surprising, what’s missing
  3. Apply one or two targeted improvements, not a dozen
  4. Rerun and check whether the fix generalized or just patched the single failing case

Iteration heuristics

Read transcripts, not just final outputs. If the agent took eight turns where two would have sufficed, the skill is making it wander. Look for unproductive loops — they’re where the leanest wins hide.

Bundle repeated work into scripts/. If three independent test runs each wrote a similar helper script, the skill should ship that script. Write it once, place it in scripts/, point the skill to it. Future runs skip the re-derivation.

Keep the prompt lean. Cut instructions that aren’t earning their weight. More words don’t mean more reliability — they often mean more ways to misfire.

When feedback is terse, look harder. A grumpy one-line user comment (“this is wrong”) is a signal, not a summary. Get into their head, figure out what specifically failed, and fix the underlying pattern rather than the surface complaint.


6. Pattern Evolution

Iteration doesn’t just improve the content of a skill — it sometimes changes the shape. A Tool Wrapper that keeps re-deriving the same template is asking to become a Generator. A Generator whose outputs drift in step order is asking to become a Pipeline.

Pattern Evolution Through Iteration Let testing reveal when to upgrade the structure Tool Wrapper SKILL.md + references/ starting point 3/3 tests generate similar templates → extract to assets/ Generator + assets/template adds structure steps need enforced order + validation → add scripts/ + gates Pipeline + scripts/ + sequential gates full orchestration Patterns can also shrink — simplify when structure isn't earning its cost.

Two signals to watch for:

  • Repeated structure across outputs → promote the structure into assets/template.*. The skill migrates from Tool Wrapper to Generator.
  • Unenforced step ordering or validation → add scripts/ with gates and a stepwise SKILL.md body. The skill migrates to Pipeline.

The inverse is equally valid. If you added scripts/ once and never used them, delete them. Structure that isn’t earning its maintenance cost should shrink. The point of iteration is not to climb the ladder — it’s to stop at the rung that’s actually load-bearing.


Anti-Patterns

  • Writing the skill and declaring victory. A skill you haven’t tested is a hypothesis. Run at least 2–3 realistic prompts before shipping.
  • Overfitting to the test prompts. IF user asks about Q4, THEN ... style rules patch the failing example and break generalization.
  • All-caps MUST blocks. Strong-arming the model with capitalized imperatives usually does less than explaining the reason.
  • Vague descriptions. A skill with an abstract description won’t trigger reliably, no matter how good the body is.
  • Monolithic SKILL.md. If it’s over 500 lines with no structure, the agent loads all of it every time. Split depth into references/.

Was this page helpful?