SKILL.md best practices: How to craft reliable skills

Insights AI News SKILL.md best practices: How to craft reliable skills

AI News

19 Jan 2026

Read 17 min

SKILL.md best practices: How to craft reliable skills

SKILL.md best practices let you craft short reliable agent skills that save tokens and avoid failures.

Good SKILL.md best practices keep skills short, clear, and reliable. Set a strict token budget. Explain when the agent should use the skill. Add a required checklist that the agent must follow. Include tiny, concrete examples and a recovery plan. Track results and prune what does not work. This simple pattern raises quality fast. The agent skill ecosystem is exploding. Thousands of SKILL.md files now live on GitHub. Many promise speed and magic. Most deliver noise. The problem is not the format. The problem is weak craft. If you care about quality, you need SKILL.md best practices that protect the context window, guide agent behavior, and earn trust through tests and metrics. This article gives you a practical playbook to do that.

Why short, focused skills win

Large skills hurt model focus. Tokens are context budget. The model pays attention to every word you add. Big pages blur intent, raise cost, and slow runs. Small, sharp instructions do the opposite. They point the model at the next action and reduce mistakes.

Use a clear token budget

Set budgets per skill type. Keep them visible in your repo and enforce them in CI.

Workflow/process skill: 400–800 tokens (hard cap 1,000)

Tool wrapper/integration: 150–400 tokens

Reference snippet/definition: 80–250 tokens

Emergency/runbook skill: split into sub-skills; each under 800 tokens

If you exceed the budget, split the skill. Make a parent skill that decides when to call the smaller parts.

Give each skill one job

One skill should do one thing well. Do not mix setup, workflow, and debugging in one file.

State the goal in one sentence.

State when to use it in two or three bullets.

State what not to do if it helps avoid common traps.

Delete generic advice. Do not restate how to write code, commit, or ask for help, unless it is unique to this workflow.

SKILL.md best practices for structure

Use a predictable frame so the model recognizes parts at a glance and tools can parse it if needed. This layout works well:

1) Header and when-to-use

Start with a tight header block that answers three questions.

Goal: The outcome in 1 short line.

When to use: 2–5 bullets with clear triggers, like “Need to write a failing test first” or “Parsing CLI error X.”

Prerequisites: Tools, files, or secrets that must exist.

Avoid fluffy descriptions. Tell the agent the conditions that turn this skill on.

2) Required checklist at the top

Force structure from the first read. Put a “required” block at the top and ask the agent to add steps to its task list. If your IDE agent has a Todo tool, call it by name.

Say “Add these steps to your Todo list” at the start.

List 4–8 steps. Each step must be one action that can be checked off.

Add small reminders inline, like “If stuck after three loops, switch to the debug skill.”

This pattern turns vague goals into clear execution and reduces agent drift.

3) Guardrails and signals with hint tags

Models respond well to visual anchors. Use lightweight “hint tags” to mark sections and intentions. Many agents learn from tags like:

for the checklist

for short, high-priority warnings

and for contrastive examples

Keep each tag small. One to three lines. Do not wrap long essays in tags.

4) Minimal, contrastive examples

Examples beat prose. Show one tiny “good” example and one tiny “bad” example.

Good: A passing commit message. A correct CLI invocation. A minimal test that fails for the right reason.

Bad: A long, vague comment. A command with missing flags. A test that fails for the wrong reason.

Contrast teaches the model the boundary.

5) Failure and recovery path

Tell the agent what to do when things go wrong. Keep it short.

Retry with a smaller change.

Parse the last error line, not the whole log.

Switch to the “debug” skill after N failures.

Ask for missing prerequisites if a command fails with “not found.”

Define N (for example, “after 3 attempts”). Precision helps.

6) Output expectations

End with what “done” looks like and how to report back.

State files changed and tests that must pass.

State the format of the final summary to the user, like “Summarize in 3 bullets and link the diff.”

Clear end states reduce looping.

Language patterns that models follow

Style choices matter. Simple language gives better runs. Use these patterns:

Write in first person and imperative

Use “I” voice where the agent is the actor.

Say “I will add these steps to my Todo list.”

Say “I will run the tests and fix failures.”

This aligns the plan with the agent’s next actions.

Prefer short sentences and verbs

Use short lines and active verbs.

“Create a failing test.”

“Run the test.”

“Write the minimal code to pass.”

“Refactor and re-run tests.”

Avoid meta talk like “In this step, the model should consider.”

Do not lecture

Replace long explanations with one example or a single reminder line. Each extra sentence costs attention.

Integration hygiene and context economy

Great content fails if integration is messy. Keep the plumbing clean so the agent can find and use skills the right way.

Reference skills explicitly

Add every skill to your Agents.md or host editor config. Include:

Full path to the SKILL.md file.

The “when to use” bullets (not a summary).

Any tool or environment hooks (Todo tool, run tool, test tool).

Models are more likely to fetch and follow skills when the index is clear.

Make wrappers thin

If a skill wraps a CLI or script, let the tool carry the load.

Keep the SKILL.md short. List the exact flags and one correct example.

Improve the tool’s error messages, not the skill text.

Better error messages guide the agent automatically on failure.

Parameterize paths and secrets

Use variables like {{repo_root}} or {{skills_dir}} instead of hard-coded paths. Never paste secrets or tokens into the skill.

Avoid redundant context

Do not paste project docs, long API schemas, or entire runbooks into a single skill. Link or point to a smaller, purpose-built skill instead.

Quality bar: how to test and iterate

Treat skills like products. Write tests. Track numbers. Delete what does not perform.

Build a simple test harness

Create a small test repo of “golden” tasks that target each skill.

Have a failing test project for TDD skills.

Have a broken CLI config for wrapper skills.

Have a missing secret scenario for setup skills.

Run your agent on these tasks after each change and capture results.

Measure what matters

Log and review core signals:

Success rate per skill (did the agent reach “done” state?).

Average tokens added by the skill.

Average run time and number of tool calls.

Loop count before success or handoff to debug skill.

If a skill is big and slow but not more successful, shrink it.

Do A/B with self-play

Keep two versions of a skill. Route half of runs to each and compare outcomes on the same tasks. Keep the winner. Repeat. This trims bloat without guesswork.

Use negative tests

Test that the agent does not pick the skill when it should not. Add a prompt that looks similar but is outside the “when to use” conditions. If the agent still calls the skill, tighten the triggers.

Curation, versioning, and trust

Discovery today is noisy. You need a filter.

Pin versions and write change logs

Treat skills like dependencies.

Use semantic versions or commit hashes.

Document what changed and why in a short CHANGELOG in the skill’s folder.

Mark breaking changes and set a deprecation date for old versions.

This makes rollbacks safe.

Score your skills

Create a simple scorecard for each skill:

Adoption: How often does the agent pick it?

Effectiveness: Success rate on golden tasks.

Efficiency: Tokens added and average runtime.

Clarity: Number of lines and examples count.

Only publish skills that hit your bar. Archive the rest.

Trust signals beat big lists

A small, curated set will outperform a giant index of random skills. Prefer repos that show test data, version pins, and tight token budgets. Be wary of auto-generated skills with long prose and no examples.

Common traps and how to fix them

Most bad runs trace back to a few repeat mistakes.

Giant essays in “description”

Problem: The skill spends paragraphs telling backstory and theory. Fix: Replace with 2–5 trigger bullets under “When to use.”

Hidden prerequisites

Problem: The agent calls a tool that is not installed. Fix: List prerequisites at the top. If missing, ask the user to install or run a setup skill.

Unbounded retries

Problem: The agent loops forever. Fix: Add “If still failing after 3 attempts, switch to the debug skill.”

Environment-specific commands

Problem: Commands work only on one OS or shell. Fix: Provide a cross-platform command or list platform-specific alternatives in one line each.

Tool hallucination

Problem: The agent invents a non-existent command. Fix: List the exact allowed commands and flags. Add a that shows a made-up command.

Step explosion

Problem: The agent adds 20 tiny steps and loses context. Fix: Group small actions into 4–8 checkpoints in the required checklist.

Wrong persona

Problem: The agent “advises the user” instead of acting. Fix: Use first person and imperative language throughout.

When a longer skill makes sense

Long skills are rare but can be right for regulated flows or emergency runbooks. If you must go long:

Split into sub-skills and add a small router skill that decides which part to call.

Include a short outline at the top with jump markers.

Use clear gate checks between phases to stop drift.

Keep each example tiny; add more examples only if they cover different edge cases.

The goal is to keep focus and give the agent safe exits.

From MCP servers to files: migration notes

Many teams move from MCP servers to simple files because agents read files well and setup is easier. To migrate:

List each existing tool and map it to a thin wrapper skill with 1–3 examples.

Move long reference text into separate, smaller skills and link them.

Add a “debug” skill that centralizes failure handling across migrated skills.

Keep the server for heavy operations if needed, but keep the SKILL.md as the light front door.

Start small, test on golden tasks, and keep cutting tokens.

Example checklists you can copy

Use these quick checklists to raise quality right now.

Before you publish a skill

Goal is one line and testable.

“When to use” has 2–5 clear triggers.

Required checklist has 4–8 steps and asks the agent to add them to its Todo tool.

One good and one bad example are present.

Failure path tells the agent when to switch to debugging.

Token count is within the budget for the skill type.

Before you enable a skill for a team

Added to Agents.md with full path and trigger bullets.

Pinned version or commit hash.

Passes golden tasks with a published success rate.

Logged tokens and runtime look healthy.

Negative test proves the agent does not overuse it.

These habits will prevent pain later.

Putting it all together

Build small skills that act like checklists. Speak in first person. Use hint tags to mark priority and examples. Define a token budget and enforce it. Reference skills clearly in your agent config. Test with golden tasks, watch the numbers, and A/B better variants. Curate a small set that you trust. This is what strong SKILL.md best practices look like, and this is how you turn noise into reliable agent action.

(Source: https://12gramsofcarbon.com/p/your-agent-skills-are-all-slop)

For more news: Click Here

FAQ

Q: What is a SKILL.md file and how does it function for coding agents? A: A SKILL.md is a markdown file with persistent instructions for an agent, effectively an extended prompt the agent can call when it needs guidance. Good SKILL.md best practices emphasize keeping these files simple, readable, and directly actionable for agents. Q: What token budgets should I use for different types of skills? A: Set visible token budgets per skill type and enforce them in CI; recommended ranges are workflow/process skills 400–800 tokens (hard cap 1,000), tool wrappers 150–400 tokens, reference snippets 80–250 tokens, and emergency/runbooks split into sub-skills under 800 tokens each. If a skill exceeds its budget, split it and use a parent skill to route to the smaller parts. Q: How should I structure a SKILL.md for clarity and reliable triggering? A: Use a predictable frame: a tight header with goal, when-to-use bullets, and prerequisites, then a top-level checklist, hint tags like , and one tiny good/bad example. These SKILL.md best practices help agents recognize sections quickly and reduce ambiguity. Q: What belongs in the required checklist and how many steps should it have? A: Put a block at the top and tell the agent to add the steps to its Todo tool, listing 4–8 clear, checkable actions where each step is a single action. Include brief inline reminders such as “if stuck after three loops, switch to the debug skill” to avoid drift. Q: How should failure and recovery be handled inside a skill? A: Include a short failure and recovery path that instructs the agent to retry with smaller changes, parse the last error line, and switch to a debug or sub-skill after a defined number of attempts; the article suggests defining N, for example “after 3 attempts.” Clear recovery rules are a core element of SKILL.md best practices and prevent unbounded loops or hallucinations. Q: What language and style choices improve agent behavior in skills? A: Write in first person and imperative voice with short sentences and active verbs, for example “I will add these steps to my Todo list” and “Create a failing test.” Avoid long explanations or lectures and prefer a single compact example or reminder to conserve context. Q: How do I test, measure, and iterate on a skill to ensure quality? A: Treat skills like products by building a simple test harness of golden tasks (failing test projects, broken CLI configs, missing-secret scenarios) and log signals like success rate, tokens added, run time, and loop count. Use A/B self-play to compare variants and negative tests to ensure the agent does not call the skill outside its triggers. Q: How should teams manage discovery, versioning, and curation of skills? A: Pin skill versions or commit hashes, keep a short CHANGELOG, and score skills on adoption, effectiveness, efficiency, and clarity before publishing; a small curated repo with test data and tight token budgets beats giant uncurated lists. Also reference every skill explicitly in Agents.md with the full path and trigger bullets so agents can find and trust them.

SKILL.md best practices: How to craft reliable skills

Why short, focused skills win

Use a clear token budget

Give each skill one job

SKILL.md best practices for structure

1) Header and when-to-use

2) Required checklist at the top

3) Guardrails and signals with hint tags

4) Minimal, contrastive examples

5) Failure and recovery path

6) Output expectations

Language patterns that models follow

Write in first person and imperative

Prefer short sentences and verbs

Do not lecture

Integration hygiene and context economy

Reference skills explicitly

Make wrappers thin

Parameterize paths and secrets

Avoid redundant context

Quality bar: how to test and iterate

Build a simple test harness

Measure what matters

Do A/B with self-play

Use negative tests

Curation, versioning, and trust

Pin versions and write change logs

Score your skills

Trust signals beat big lists

Common traps and how to fix them

Giant essays in “description”

Hidden prerequisites

Unbounded retries

Environment-specific commands

Tool hallucination

Step explosion

Wrong persona

When a longer skill makes sense

From MCP servers to files: migration notes

Example checklists you can copy

Before you publish a skill

Before you enable a skill for a team

Putting it all together

FAQ

Similar Articles

How to fix HTTP 407 download error in 5 steps

Microsoft OpenAI partnership 2026 update: How it affects you

How AI is transforming dentistry to speed diagnosis

DoD AI-enabled coding tools procurement How to scale safely

How to fix 401 unauthorized error and regain site access

non-consensual deepfake porn in China How to Stop It Now