Claude Skills Architecture: Dynamic Capability Injection for LLM Agents

In production LLM deployments, the tension between capability and context length is acute. Add tool definitions to your system prompt, and you burn tokens on every inference. Build a monolithic agent that handles 50 different tasks, and the model trains its attention on irrelevant pathways, degrading performance on core work. Anthropic’s Claude Skills system, launched in 2025, offers a third way: narrow, on-demand capabilities loaded into the agent’s context only when relevant.

This post deconstructs the architecture beneath Claude Skills — how skill discovery works, what makes them different from MCP servers or retrieval-augmented generation (RAG), and the design patterns that separate elegant skill orchestration from capability bloat. If you’re building production agents that need to scale across multiple domains without sacrificing latency or token efficiency, this reference will anchor your decisions.

What Claude Skills Are

Claude Skills are self-contained capability packages that inject narrowly scoped expertise into an LLM agent’s context on demand. Each skill lives in a folder containing a SKILL.md manifest (which defines a brief description and the full capability body), scripts or reference files, and optional companion tooling. The critical innovation: only the skill’s short description lives in the system prompt. When the agent’s task matches that description — either via keyword, embedding similarity, or explicit routing — the full SKILL.md body loads into the context, expanding the agent’s capabilities without pre-baking them into every inference.

Think of it as lazy-loading expertise. Contrast this with three existing patterns:

System prompts: Write all your agent directives into a single massive prompt file. Every inference pays the cost of that entire context, even if the user only needs a small fraction. This works until you have 20+ distinct capabilities — then you’ve hit token limits or incurred unacceptable latency.

MCP servers: The Model Context Protocol (defined by Anthropic) formalizes tool discovery and invocation. MCP tools are available but stateless — they’re summoned during inference via explicit function calls. MCP is excellent for integrations (Slack, GitHub, databases) where you’re calling external systems, but it doesn’t help compress the behavioral knowledge the model needs to use those tools correctly. You still embed coaching, patterns, and decision logic into your system prompt.

RAG (retrieval-augmented generation): Fetch documents at query time and inject them into context. RAG excels at factual retrieval (product docs, knowledge bases) but is expensive for behavioral patterns. Retrieving a 10,000-token skill definition on every inference duplicates work and burns context budget.

Skills split the difference: they are behavioral programs in Markdown that get injected based on relevance signals (matching the user’s task description), not explicit function calls or embeddings. A skill for “code review” lives dormant in a folder. When you ask the agent to review a pull request, the skill description triggers, the full body loads, and the agent executes the review workflow—all without adding that payload to every other task.

Anthropic introduced this pattern in mid-2025 as demand grew for agents that could operate across many domains (research, code, analysis, writing, planning) without bloating the system context. The architecture draws from Slack’s notion workflows, CI/CD pipelines, and automation orchestration — but adapted for LLM behavior rather than event triggers.

Reference Architecture

Claude Skills architecture consists of five key layers: discovery, progressive disclosure, skill anatomy, orchestration, and deployment. Let’s walk the flow.

Skill discovery begins with the system prompt. Rather than listing 20 full capability descriptions, you embed a lightweight skills registry — just the skill names and their one-line purposes. When a user submits a task, the agent matches it against this registry. The matching logic can vary: keyword overlap (“I need code review” matches the code-review skill), embedding similarity (semantic closeness to the skill description), or explicit routing (the agent infers which skill applies). Some systems use a dedicated router model (a small classifier trained to map tasks to skills), while simpler deployments rely on the agent’s own reasoning.

Once matched, the skill’s full body — stored in SKILL.md — loads into the context. This is the progressive disclosure step. At the moment a skill becomes relevant, you unfold its complete definition, examples, and behavioral instructions. The subsequent inference then executes within that enriched context.

The skill anatomy — what lives inside a skill folder — follows a standard shape:

SKILL.md: The Markdown manifest. Contains the skill description (used for discovery), the full capability body (loaded on match), success criteria, examples, edge cases, and internal links to companion resources.
Companion scripts or files: Python, shell, or JSON files that the skill references. A code-review skill might include a lint-checklist.yaml or a security-patterns.json. These are not executed by the skill itself but are read and reasoned about by the agent.
References and links: Internal pointers to related skills or external documentation. A “code-review” skill might reference an “architecture-review” skill for decisions that require design-level judgment.

An exemplary SKILL.md structure:

# Code Review Skill

## Description
Conduct security, performance, and correctness reviews of code changes.

## Success Criteria
1. All security rules from `security-patterns.json` are checked.
2. N+1 query antipatterns are surfaced with recommendations.
3. Response is structured: summary → findings → suggested changes.

## Skill Body
[150-400 words of detailed review methodology, decision trees, tone guidelines, template structure]

## Examples
**Example 1: Reviewing a database query change**
[Real example and model output]

**Example 2: Catching a missing authentication check**
[Another worked example]

## References
- See [[architecture-review]] skill for design-level decisions.
- Refer to `security-patterns.json` for the authoritative rule set.

Skills can reference each other — a “security-review” skill might invoke an “architecture-review” skill when structural decisions surface. This creates a skill orchestration pattern: some skills are leaves (they execute standalone), while others are meta-skills that coordinate child skills.

A meta-skill for “incident response” might orchestrate child skills for triage, communication, remediation, and postmortem writing. The agent doesn’t call these children explicitly via function calls; rather, the incident-response skill’s body contains branching logic that references when to “hand off to the triage skill” or “invoke the postmortem writer.” On the next inference turn, the matching engine re-evaluates: does the task still map to incident-response, or has it narrowed to a child skill? This creates a natural hierarchical execution flow without needing an external orchestrator.

The deployment pipeline for skills mirrors the structure you’d use for feature flagging or configuration management in traditional software:

A skills registry (a YAML or JSON manifest) lists all available skills, their versions, and a maturity level (experimental, alpha, stable, deprecated). When you ship a new skill to production, it’s pinned to a specific version. Evaluation gates (automated tests that verify a skill’s output against known examples) run before a skill is promoted from experimental to stable. This mirrors code review for traditional software — you don’t ship a capability without validating it.

Versioning becomes critical when multiple teams own different skills. A centralized registry ensures that agent A uses skill-code-review-v2.1 while agent B hasn’t yet migrated to v2.1; once both teams are ready, you coordinate the upgrade. This is especially important in regulated settings (finance, healthcare) where an audit trail of skill changes is mandatory.

Design Patterns and the Skill Anatomy

Single-Responsibility Skills

The first instinct when building skills is to combine related capabilities into one. Resist it. A skill for “research and write and publish” is harder to test, slower to load, and fragile — if one of those steps has a bug, the whole skill becomes suspect. Better: three skills that each excel at their domain and orchestrate via references.

A single-responsibility skill is focused enough that its success criteria are unambiguous. Does this code-review skill achieve its purpose? Ship it. Does it need to also validate database migrations? No — that’s a separate concern. Create a migration-review skill and link them.

The metric: if you can’t describe the skill’s success criteria in two sentences, it’s too broad.

Skill-Orchestration Skills and Meta-Skills

Some tasks are naturally hierarchical. An “incident response” workflow involves triage (is this critical?), communication (who needs to know?), remediation (what’s the fix?), and postmortem (what failed and why?). These aren’t subtasks the agent calls explicitly; they’re sequential steps within a single skill’s logic.

A meta-skill contains branching logic that evaluates the current state and routes to child skills. The incident-response skill might say: “First, invoke triage logic to determine severity. If severity is critical, activate communication cascades and remediation. Once resolved, hand off to postmortem writing.” The agent reads this, applies triage logic, and if critical, it then recognizes that the next task (activation of communication) maps to a “notify-stakeholders” child skill, causing that skill to load on the next turn.

This pattern is powerful because the meta-skill’s behavior lives in one place (the incident-response SKILL.md), while each implementation detail lives in its child skill. Changes to how you notify stakeholders don’t ripple into incident-response logic.

Skill vs Tool vs Subagent Decision Rubric

When is something a skill, a tool invocation via MCP, or a spawned subagent?

Use a Skill when:
– The capability is behavioral (a workflow, decision logic, or pattern).
– It’s used across multiple tasks or agents and benefits from sharing.
– It’s lightweight (< 5 KB of guidance).
– It doesn’t need external state or live integrations.

Examples: code review, proposal writing, security analysis, meeting-note synthesis.

Use an MCP Tool when:
– You’re integrating with an external system (Slack, GitHub, a database, calendar).
– The operation is stateless and transactional (fetch user, update record).
– The tool’s behavior is narrow and fixed (no nuance or agent reasoning).

Examples: fetch a Jira ticket, list GitHub PRs, insert a database record, send a Slack message.

Use a Subagent when:
– The task is complex enough to warrant its own context and potentially its own tool set.
– You need to isolate the task’s reasoning from the parent agent’s chain of thought.
– Failures should be recoverable without rolling back the parent agent’s state.

Examples: spawning a research agent to investigate a question, or a code-generation agent to produce a library.

Many systems conflate these. A common mistake: embedding MCP tool definitions into the system prompt (bloating it) rather than calling tools at execution time and letting the agent reason about the response. Another: building a mega-tool that should be a skill (decision logic belongs in natural language, not JSON schemas).

Trade-offs and Gotchas

Skill description quality is the linchpin. If your one-line description doesn’t clearly signal when to load the skill, the matching step fails. “A useful skill for many things” will never match. “Conduct security and performance reviews of code changes” is actionable. Invest in writing descriptions that a human could use to decide whether this skill applies to their task.

Context budget remains finite. Loading five skills at once might still exceed your context window. Prioritize: what’s the minimum set needed for this inference? Some systems use a secondary ranking step — after matching all potentially relevant skills, rank them by relevance score and load only the top three. This trades completeness for latency.

Skill versioning can become complex in distributed systems. If you update a skill’s behavior, do all agents immediately use the new version, or do they stay pinned to the old version until explicitly migrated? Most teams pin versions in the skills registry and gate migration behind a feature flag or approval process. Uncoordinated skill updates can cause subtle regressions if the new version changes tone, structure, or decision logic.

Skill conflicts arise when two skills claim the same domain. A “code-review-security” skill and a “code-review-performance” skill might both match a user’s request to review code. The resolution is explicit: either the skills coordinate (security skill references performance skill and vice versa), or the system uses a secondary ranking heuristic to pick the best match. Without explicit conflict resolution, you risk loading the wrong skill or both (wasting context).

Evaluation challenges: Testing a skill is harder than testing a function. The skill’s output depends on the model, the prompt phrasing, the temperature setting, and the input context. A code-review skill that worked perfectly in testing might produce off-topic responses in a live system when invoked alongside three other skills. Robust evaluation requires golden examples (worked code reviews) and automated checks (does the output follow the template? does it cite the security rules?). Some teams use a confidence threshold: if the skill’s output doesn’t match a rubric, the agent rejects it and falls back to a default behavior.

Practical Recommendations

Start narrow. Ship a skill for one clearly-bounded task. Code review. Meeting-note synthesis. A single research workflow. Validate that the matching logic is reliable and the skill executes well before orchestrating multiple skills.

Version and audit. Tag every skill with a semantic version (1.0.0, 1.1.0, 2.0.0). Log which skill version was invoked for each task. In regulated settings, this audit trail is non-negotiable.

Test the matching logic independently. Don’t wait until full inference to discover that your skill descriptions are ambiguous. Pre-evaluate: does a task map to the right skill? Run a small test suite of user prompts and verify that the matching engine is accurate.

Limit orchestration depth. A meta-skill can reference child skills, but avoid five levels of nesting. Each level of orchestration adds inference steps and complexity. Keep the tree shallow (meta-skill → 2-3 child skills).

Use skill references, not full copies. If two skills need the same decision logic (e.g., a checklist of security rules), don’t duplicate the checklist in both skills. Create a standalone reference file and link it from both skills. Changes to the checklist then flow to both skills automatically.

Monitor skill performance. Track metrics: does code-review skill produce outputs that developers actually use? How often does it match the wrong task? Use this data to refine skill descriptions and success criteria.

FAQ

What are Claude Skills?
Claude Skills are self-contained, narrowly-scoped capability modules that dynamically load into an LLM agent’s context when relevant. Unlike system prompts (which are always loaded), skills use lazy-loading: the full skill body only expands when the agent’s task matches the skill’s description. This reduces token overhead while scaling capabilities.

How are Claude Skills different from MCP tools?
MCP tools are for external integrations (Slack, GitHub, databases) and are called explicitly during inference. Skills are for internal behavioral patterns and load implicitly based on task matching. MCP tools excel at doing things (fetching data, updating records); skills excel at reasoning (workflows, decision logic, patterns).

Can you combine Claude Skills with RAG?
Yes. A skill might reference a retrieved document within its execution logic. For example, a “proposal-writer” skill could retrieve company templates or previous proposals via RAG and use them as examples. The distinction: RAG retrieves factual data at query time; skills define how to use that data. They’re complementary, not competing.

How do you version Claude Skills in production?
Store skills in a registry with semantic versions (1.0.0, 1.1.0). Agents pin to a specific version when they load a skill. Use feature flags or gradual rollouts to migrate agents to new versions. This is similar to how you’d version API endpoints or software libraries.

Are Claude Skills open source?
Anthropic released an open-source skills repository (https://github.com/anthropics/skills) with examples and templates. The skills system itself is built into Claude’s inference stack, so any developer can create and deploy skills. There’s no central registry requirement — you can run skills locally, in a private registry, or published to Anthropic’s shared hub.

Claude Skills Architecture: Dynamic Capability Injection for LLM Agents

Claude Skills Architecture: Dynamic Capability Injection for LLM Agents

What Claude Skills Are

Reference Architecture

Design Patterns and the Skill Anatomy

Single-Responsibility Skills

Skill-Orchestration Skills and Meta-Skills

Skill vs Tool vs Subagent Decision Rubric

Trade-offs and Gotchas

Practical Recommendations

FAQ

Further Reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories