The Skill That Builds Skills: Skill Creator V2

How Skill Creator V2 became a meta-skill for turning expert work into reusable agent abilities.

Key takeaways

A useful skill is not just a prompt. It is a repeatable workflow with evidence, tests, boundaries, and a clear failure model.
Skill Creator V2 starts by classifying the kind of human work being automated before it writes the skill.
The core architectural move is a multi-axis taxonomy: activity, domain, tool surface, risk, evidence, and workflow shape.
This makes the system safer and more mature than simple "generate me a SKILL.md" templates.
The public repository is here: sergekostenchuk/skill-creator-v2.
Companion research: Deep Skill Class Taxonomy for Skill Creator V2.

There is a moment when a useful trick stops being enough.

At first, an agent skill feels simple. You write a SKILL.md, describe when it should be used, add a few instructions, maybe include scripts or references, and the agent becomes better at a specific kind of work. That is already powerful. A skill can teach an agent how to operate a router, audit SEO, inspect a Figma canvas, review legal documents, or generate a video composition.

But after building enough of them, the weakness becomes obvious: the hard part is not writing one more instruction file. The hard part is deciding what kind of skill it should be, what evidence it must produce, how dangerous a mistake would be, whether it should be a single skill or a group of cooperating skills, and how to prove that the result actually works.

That is where Skill Creator V2 came from.

I started calling it, half-jokingly, the "father of skills." Not because it is magical. The point is the opposite: it tries to remove magic from skill creation. It treats a skill as an engineering artifact, not as a clever prompt.

The Problem With "Just Generate a Skill"

The naive workflow is tempting:

> "Here is a domain. Write me a skill."

An LLM can do that. It can produce a clean-looking SKILL.md, invent a workflow, add impressive words like "best practices," and make the result feel complete.

The problem is that a clean-looking skill can still be weak.

It may trigger too broadly. It may use the wrong tools. It may skip verification. It may tell the agent to modify production infrastructure without a rollback plan. It may treat legal reasoning like ordinary text generation. It may call a screenshot "evidence" without saying what that screenshot proves. It may create scripts that nobody validated. It may be useful once, but unsafe as a reusable capability.

That is why Skill Creator V2 does not begin with writing. It begins with classification.

Before generating a skill, it asks: what human work is this skill trying to capture?

Is it research? Analysis? Configuration? Publishing? Monitoring? Legal defense? Visual design? Browser automation? Knowledge management? Something that creates other skills?

The answer matters because each kind of work has different proof.

A Skill Is a Workflow, Not a Label

One of the strongest research inputs came from the KIMI taxonomy pass, summarized in the companion note Deep Skill Class Taxonomy for Skill Creator V2. The important argument is simple:

flat skill taxonomies break.

If you try to classify skills with one label, you quickly mix things that should stay separate. For example:

Figma is not a skill class. It is a tool surface.
Browser is usually not a skill class. It is often a tool surface, unless browser control itself is the work.
Security is not always a domain. Often it is a risk profile, a review activity, or a hardening concern across domains.
Legal is not just "writing." Legal defense has deadlines, citations, jurisdiction, evidence, and human approval requirements.
Infrastructure is too broad if it hides VPS provisioning, VPN routing, DNS/SSL, router configuration, and deployment under one bucket.

The KIMI research argued that a serious skill creator needs a multi-axis model. I agree. It became the architectural center of Skill Creator V2.

Instead of asking "what class is this skill?", the system builds a classification packet:

activity type: what the agent does;
domain: the professional subject matter;
tool surface: where the work happens;
risk profile: what can go wrong and how costly it is;
evidence profile: what proves the work was done correctly;
workflow shape: whether this is one pass, a pipeline, a reviewer loop, or an orchestrator-worker system.

This changes the whole design.

A WireGuard skill is not just "infrastructure." It is network/private connectivity work, usually touching terminal, server SSH, filesystem, and production/security risk. It needs tunnel status, routing evidence, leak checks, config diffs, and rollback notes.

A Figma canvas skill is not just "design." It touches a specific GUI tool surface. It needs node IDs, before/after screenshots, selected-frame evidence, and scope limits.

A legal defense skill is not just "writing documents." It needs facts separated from assumptions, dated citations, jurisdiction notes, procedural deadlines, and a human decision gate.

The same pattern applies across the library.

Why This Is More Mature Than a Prompt Template

Prompt templates are useful. They save time. They give the model a voice and a structure.

Skill Creator V2 goes further. It tries to produce reusable capabilities that can survive repeated use.

That requires several layers.

First, it creates a skill boundary. A good skill says what it owns and what it does not own. This prevents the agent from turning every task into the same oversized procedure.

Second, it creates an evidence contract. Evidence is not just a screenshot, a log, or a claim. It has to explain what was collected, why it matters, where the artifact is stored, how it is validated, and what should be redacted before anything is published.

Third, it creates risk-derived gates. Human review is not a decorative checkbox. It should appear because the work is legally sensitive, externally visible, destructive, security-relevant, or connected to production infrastructure.

Fourth, it creates evals and regression checks. If a skill claims to be production-ready, there should be some way to test it. Not every skill can be perfectly measured, but every serious skill can have negative cases, edge cases, and sanity checks.

Fifth, it supports skill groups. Some workflows should not be one giant skill. If different stages require different expertise, different tools, or independent review, the right answer is an orchestrator with worker skills and handoff contracts.

The "Father of Skills" Is Really a Reviewer

The name sounds like a generator. In practice, Skill Creator V2 is just as much a reviewer.

It should be able to look at a proposed skill and ask uncomfortable questions:

Does this skill trigger too broadly?
Is the risk understated?
Is evidence required or merely suggested?
Are scripts validated?
Is the user being asked for approval at the right point?
Is the skill trying to be a whole department instead of one reusable capability?
Should this be a skill group rather than one skill?
Are we claiming success without proof?

This is important because agents are very good at producing confident artifacts. The job of a mature skill system is to slow that down when the cost of being wrong is high.

The goal is not bureaucracy. The goal is controlled autonomy.

Low-risk skills should remain fast. A small internal writing helper does not need a full governance pipeline. But a production deployment skill, router operation skill, legal defense skill, or meta-skill that creates other skills needs stronger gates.

That is the core design principle: more risk requires more evidence.

How the Work Evolved

The project did not start as a clean theory.

It started from practical frustration. I was creating more skills: SEO and LLM discoverability skills, UI/UX skills, infrastructure skills, browser operation skills, legal reasoning skills, media generation skills, and skills for maintaining an Obsidian-style knowledge base.

Each new skill raised the same questions:

What should go in SKILL.md?
What belongs in references?
What should be deterministic script logic?
How do we test it?
When should it ask the user for permission?
When is a task too broad for one skill?
How do we avoid fake confidence?

At the same time, I was using task plans to keep the work observable. That mattered. Without a task plan, the model can drift, repeat itself, or declare work finished too early. With a task plan, decisions, gates, open questions, and done states become visible.

The Skill Creator V2 work became a meta-version of that process. It asks the same thing for every future skill:

what must be true before this can be trusted?

What Makes It Different

Most skill generators focus on output:

> "Here is your skill."

Skill Creator V2 focuses on the system around the output:

classification before generation;
risk before permission;
evidence before done;
evals before maturity claims;
reviewer gates before release;
packaging and sanitation before publication;
regression checks before future changes.

This makes it closer to a small production pipeline than a prompt writer.

It also changes how skill quality is discussed. Instead of saying "this skill is good," the system should be able to say:

what it was designed to do;
what it intentionally refuses to do;
what evidence proves it worked;
what tests were run;
what risks remain;
when it should escalate to the user.

That is a more honest standard.

Why This Matters Now

Agent skills are becoming a real interface layer. They sit between general-purpose models and real work. They carry procedures, tool assumptions, domain rules, safety boundaries, and project memory.

That makes them powerful. It also makes them risky.

A weak skill can quietly teach an agent the wrong habit. A broad skill can trigger in the wrong context. A missing evidence rule can make the agent believe a task is done because text was generated. A dangerous script can become a supply-chain problem. A meta-skill that creates other skills can multiply its own mistakes.

This is why the "father of skills" should not be a bigger prompt. It should be a disciplined creator, classifier, reviewer, packager, and critic.

The public repository is being developed here:

github.com/sergekostenchuk/skill-creator-v2

It is still evolving, but the architectural direction is clear: skills should be designed as reusable work systems, not just instruction files.

What the Companion Research Adds

The companion research explains why the system uses multiple axes instead of one class label, why tool names should not become classes, why security is often a risk dimension, why legal work needs its own high-stakes gates, and why groups should be created only when there is a real evidence or role boundary.

For readers who just want the simple version:

Skill Creator V2 is an attempt to make agent skills more honest.

Not perfect. Not magical. Not automatic expertise.

Just a more mature way to ask:

what work are we capturing, what can go wrong, what proves success, and when should the agent stop and ask for human judgment?

That is the kind of skill creator I wanted.

Not a prompt factory.

A skill that builds skills responsibly.