AI Agent Skills Are Manuals. Package Managers Own Execution.#

The fifth tool in an agent system is easy. The fifteenth is noisy. The fiftieth is a design problem.

At small scale, most teams get away with a flat list of tool definitions. They register every tool, pass every schema into the model, and let the agent figure it out. That works well enough when the list is short. It stops working cleanly when the list gets long enough that most of the contracts in context are irrelevant to the task at hand.

That is not only a memory problem. It is an attention problem.

Every unused tool definition the model has to carry is another contract it has to reason around. Every extra schema is another chance to pick the wrong tool, hallucinate an argument, or confuse one capability for another. The Berkeley Function Calling Leaderboard shows that tool selection and execution across multiple turns are central parts of real agent behavior, not a side detail.[1]

The same pattern shows up in broader function-calling benchmarks. As context gets more crowded and tool definitions get more complex, model performance degrades in measurable ways. The issue is not simply that context windows have limits. The issue is that loading unnecessary contracts into every request makes the model spend effort on decisions it did not need to make in the first place.[6]

No software team would import * from every library at startup and call it good architecture. That is effectively what many agent systems do today with tools.

AI Agent Skills Are the Guidance Layer, Not the Execution Layer#

A Skill is best understood as a focused manual. It tells an agent when to use a capability, what a good invocation looks like, what workflow patterns matter, and what failure modes to watch for. It is a guidance layer, not an execution layer.

Claude's Skills model is a good example of the right direction. It uses progressive disclosure so the system does not have to carry every possible detail about every capability all the time. A Skill becomes the first high-signal layer. Deeper details only need to show up when the relevant capability is actually in play.[4][5]

A good Skill should contain:

trigger conditions
workflow patterns
short examples
escalation notes
a pointer to execution

A good Skill should not try to carry:

the full JSON Schema for the tool
version information
runtime dependencies
environment variable handling
execution logic

A well-written runbook does not include the source code of the system it describes. It explains what to do and where to run it. Skills should work the same way.

Skills reduce context pressure only if they stay narrow. Once they start dragging execution details along with them, they stop behaving like manuals and start behaving like brittle wrappers.

How Agent Tools Accumulate Execution Debt#

The failure mode is easy to miss at first.

The first version of a Skill often looks harmless. It includes a command, a few arguments, maybe a copy of the expected input shape, maybe a note about an environment variable, maybe a version assumption. None of those choices look dangerous in isolation.

Together they become execution debt.

A Skill that owns execution eventually accumulates:

version references that drift from what is installed
hardcoded commands that stop matching the real runtime
copied schemas that fall out of sync with the tool contract
environment assumptions that do not travel across machines or teams

At that point the team is maintaining two contracts per tool:

the tool's actual contract
the Skill's embedded understanding of that contract

That duplication gets more expensive as the ecosystem grows.

This is exactly the wrong moment for agent systems to normalize that pattern. MCP adoption alone has exploded. Public MCP server counts grew from roughly 100 at launch to more than 4,000 by May 2025 and more than 10,000 by early 2026, with SDK downloads rising into the tens of millions per month.[2][3] The number of capabilities available to agents is increasing faster than most teams' discipline around packaging them.

If a Skill hardcodes python3 tool.py, it breaks the moment the tool becomes a bundled Node artifact. If a Skill carries its own input validation rules, it becomes wrong the moment the manifest changes. If it assumes a version, it becomes stale the moment the installed package moves ahead.

Software engineering solved this class of problem a long time ago. Package managers took responsibility for versioned artifacts, dependency resolution, and repeatable execution. Agent systems are hitting the same boundary now, only with tools, prompts, runtimes, and workflow surfaces instead of shared libraries.

Separating AI Agent Skills from Tool Execution#

The durable pattern looks like this:

Agent loads Skill
        ↓
Skill explains when and how to use a capability
        ↓
Skill delegates execution to a universal runner
        ↓
Runner resolves installed package, version, runtime, and environment
        ↓
Packaged tool runs consistently

Three concerns belong in three different layers:

discovery: what capabilities exist and which one is relevant now
guidance: when and how to use a capability
execution: what artifact runs, which version, with which runtime and environment

Discovery belongs to the agent system. Guidance belongs to the Skill. Execution belongs to the packaging layer.

When a Skill delegates to agentpm run @namespace/tool-name, it stops accumulating execution debt. The runner reads the installed manifest, resolves the locked version, applies environment defaults, verifies the runtime, and executes the packaged artifact the same way every time. The Skill does not need to know whether the tool is Python or Node. It does not need to know whether the runtime changed in the last release. It does not need to keep a second copy of the schema.

The Skill stays stable because it remains a document. The tool contract stays stable because it remains a versioned artifact. Neither has to impersonate the other.

Skills are manuals.
Package managers own execution.

A system with five tools can often afford a flat list of tool definitions. A system with fifty needs on-demand guidance and one canonical execution boundary. The more capabilities you expose, the more expensive it becomes to keep every contract active in context at once.

agentpm export --skill generates the scaffold for this pattern. The exported Skill is not trying to become the tool. It gives the workflow author a structured place to describe when a capability is relevant, what a good invocation looks like, and what the user or operator should know — then points execution back to agentpm run.

A CI pipeline file describes what to run and when. It does not implement the build tool. Makefile calls cargo build; it does not reimplement the Rust compiler. The same separation applies here.

How to Audit Your Agent Tool Management Today#

If your agent system has fewer than ten tools, you can probably get away with flat tool lists for a while. The pain is not large enough yet to force the architectural decision.

The right time to separate Skills from execution is before the tool catalog gets large enough that duplicate contracts become normal.

A few practical rules help:

if a Skill contains version references, it is carrying execution debt
if a Skill hardcodes a runtime command, it is carrying execution debt
if a Skill copies a full input schema, it is probably carrying execution debt
if a Skill explains workflow intent and then delegates to a stable execution surface, it is doing the right job

This is not an AgentPM-only idea. Any system with a universal runner tied to versioned installed artifacts can apply it. AgentPM is one implementation of the pattern. The transferable idea is the separation itself.

It also fits neatly with MCP. Skills can explain when an MCP-exposed capability should be used, while agentpm serve --mcp exposes the same packaged artifact through an MCP surface. The transport changes. The packaging truth does not.

The practical audit for an existing agent system:

find every Skill or prompt section that carries version assumptions
find every place where a hardcoded command leaks into guidance
find every copied schema that duplicates the manifest
find every environment variable reference that belongs in the packaging layer instead

Each one is a place where execution has leaked upward into the guidance layer.

A Skill that ends with agentpm run @namespace/tool-name is doing what a good manual should do. It explains the work, then points to the right tool to perform it.

Sources#

Berkeley Function Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
MCP adoption overview: https://www.pento.ai/blog/a-year-of-mcp-2025-review
MCP statistics: https://www.mcpevals.io/blog/mcp-statistics
Claude Skills architecture: https://platform.claude.ai/docs/en/agents-and-tools/agent-skills/overview
Claude Code system design: https://arxiv.org/html/2604.14228v1
Function calling benchmarks: https://www.klavis.ai/blog/function-calling-and-agentic-ai-in-2025-what-the-latest-benchmarks-tell-us-about-model-performance
A16Z MCP deep dive: https://a16z.com/a-deep-dive-into-mcp-and-the-future-of-ai-tooling/