diff --git a/website/blog/2026-05-09-tuning-up-copilot-skills.md b/website/blog/2026-05-09-tuning-up-copilot-skills.md deleted file mode 100644 index 75f8355..0000000 --- a/website/blog/2026-05-09-tuning-up-copilot-skills.md +++ /dev/null @@ -1,246 +0,0 @@ ---- -slug: /2026-05-09-tuning-up-copilot-skills -canonical_url: https://dfberry.github.io/blog/2026-05-09-tuning-up-copilot-skills -custom_edit_url: null -sidebar_label: "2026.05.09 Tuning Up Copilot Skills" -title: "Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills" -description: "I had 413K tokens of unoptimized skills and a waza toolkit to diagnose them. Here's what I found, what surprised me, and what actually worked." -draft: true -tags: - - GitHub Copilot - - Skills - - Waza - - Token Optimization - - AI assisted - - Tutorial -updated: 2026-05-09 18:00 PST -keywords: - - copilot skills optimization - - waza tokens - - skill refactoring - - token reduction - - copilot cli - - copilot token budget - - skill.md optimization - - copilot skills tutorial ---- - -# Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills - -I'd been ignoring the `.copilot/skills/` directoryfor a while. I knew it was growing. Every time I built a new feature or onboarded a new domain, I'd add a skill. Sometimes three. My thinking was: more skills = more capability. And for a while, that was true. - -Then I actually counted. - -**413,591 tokens across 136 files.** Six SDK sample review skills alone were consuming 140K tokens — 34% of the total budget — and I hadn't even noticed. Dead stub skills sitting around redirecting to nothing. Duplicated prose across six language variants. It was the kind of growth that creeps in when you're building fast and not auditing. - -Skills are different from context — they're loaded on demand, not held open. Optimizing them doesn't free your active context window. But it makes agent spawns faster and skill loading cheaper. Different lever, different win. I wanted both. - -The optimization patterns I found — reference extraction, checklist compression, shared references — work whether you're editing skills by hand or using Copilot CLI to batch the refactoring. I used GitHub Copilot CLI with Squad orchestration to process multiple skills in parallel, but the techniques themselves are tool-agnostic. You could apply them manually in any editor. - -## Measuring Token Usage with microsoft/waza - -I'd been meaning to look at [Waza](https://github.com/microsoft/waza) for a while. It's a skill quality toolkit, and `waza_tokens count` is exactly what I needed — it scans your skills directory and gives you a sorted breakdown of token usage. No guessing, no eyeballing file sizes. - -Here's what the output looked like on my directory: - -``` -$ waza_tokens count .copilot/skills/ -┌─────────────────────────────────┬────────┐ -│ Skill │ Tokens │ -├─────────────────────────────────┼────────┤ -│ data-plus-ai-sdk-java-sample... │ 25,841 │ -│ data-plus-ai-sdk-python-samp... │ 23,921 │ -│ ... │ │ -│ dina-small-utility │ 312 │ -├─────────────────────────────────┼────────┤ -│ Total: 117 skills │413,591 │ -└─────────────────────────────────┴────────┘ -``` - -That top number hit hard. 25K tokens for a single skill. Waza has a few other useful tools too — `waza_tokens suggest` for optimization ideas, `waza_quality` to verify I hadn't broken anything post-optimization, and `waza_dev --copilot` for frontmatter work on new skills. But for this cleanup, the tokens count was the starting gun. - -## Planning the Work - -I analyzed the skills directory and decomposed the work into phases, ordered by expected savings. The logic was simple: don't spend time on small wins until you've cleared the big ones. - -| Phase | Target | Est. Savings | -|-------|--------|-------------| -| 1. Kill stubs | 3 empty redirect skills | ~73 tokens | -| 2. Refactor giants | 6 SDK review skills (140K!) | ~120K tokens | -| 3. Optimize large | 14 skills (5K–10K each) | ~30–50K tokens | -| 4. Optimize medium | 60 skills (1K–5K each) | ~10–20K tokens | -| 5. Trim small | 20 skills (under 1K each) | minimal | -| 6. Audit references | Large reference files | ~10–15K tokens | - -![Phase plan: 6 phases with baseline 413K tokens and estimated savings](./media/2026-05-09-tuning-up-copilot-skills/image6.png) - -The key insight: start with the biggest consumers. Phases 1–3 were going to capture roughly 90% of the savings. Phases 4 and 5 were nice-to-haves — we'd do them if there was time and energy. - -Spoiler: we didn't finish Phase 4. More on that in the lessons section. - -## Phase 1: Killing the Stubs - -Three skills turned out to be redirect stubs — they pointed to other skills and contained under 50 tokens of actual content. No routing logic, no checklist, no value. - -Deleted instantly. - -**Savings: −73 tokens.** Barely worth counting, but this is the boring-is-good part of the work — a clean directory is easier to reason about, and stubs are just future confusion waiting to happen. - -## Phase 2: The Giants - -This is where things got interesting. - -Six SDK sample review skills — one per language (Java, Python, Go, .NET, TypeScript, Rust) — were enormous. Each one had been built with the same template: 15–16 detailed rule sections, full code examples inline, everything an agent could possibly need to review a code sample in that language. The problem was that *everything* meant every token, every time. - -The technique here is what I'd call reference extraction. Instead of keeping all those detailed rules inline in `SKILL.md`, you move them into `references/` files and keep the `SKILL.md` slim — just routing info, a quick checklist, and the blocker-level issues. When an agent loads the skill, it gets the overview. If it needs the deep rules, it reads the reference files on demand. Two-tier architecture, essentially. - -I ran all six in parallel, one per language: - -| Skill | Before | After | Reduction | -|-------|--------|-------|-----------| -| Java | 25,841 | 1,541 | **94%** | -| Python | 23,921 | 1,083 | **95%** | -| Go | 24,355 | 1,815 | **93%** | -| .NET | 23,355 | 1,378 | **94%** | -| TypeScript | 21,543 | 1,525 | **93%** | -| Rust | 21,303 | 1,643 | **92%** | -| **Total** | **140,318** | **8,985** | **~131K saved** | - -Zero content removed from the skill suite. Every rule, every code example — preserved in reference files. This is the trade-off worth naming: agents now navigate a two-tier structure (SKILL.md → references/) instead of having everything in one place. Discoverability costs something. I decided it was worth it here because these skills are used frequently enough that agents will learn the pattern. - -![Phase 2 complete: SDK skills before/after showing 94%+ reduction per language](./media/2026-05-09-tuning-up-copilot-skills/image11.png) - -## Phase 3: Large Skills - -14 more skills in the 5K–10K range, processed in 4 parallel batches. `azure-mcp-content-generation`, `dina-reskill`, `context-diagnostics` — all optimization targets, all following the same pattern as Phase 2. Extract the verbose sections, keep the core routing slim. - -**Savings: −68,084 tokens (76% reduction)** - -## Running Totals After 3 Phases - -By this point we'd done the heavy lifting: - -``` -Phase 1 (stubs): −73 tokens -Phase 2 (giants): −131,333 tokens -Phase 3 (large): −68,084 tokens -──────────────────────────────── -Total saved: ~199,490 tokens -``` - -![Phase 3 complete with running totals: ~199K tokens saved, 214K remaining](./media/2026-05-09-tuning-up-copilot-skills/image14.png) - -About halfway through the session I started feeling good about the numbers. That's usually when something goes sideways. - -## The PR and the Review - -PR #147: **106 files changed, 12,176 insertions, 18,571 deletions.** - -![Pull request showing 65% Copilot skills token reduction across 106 files](./media/2026-05-09-tuning-up-copilot-skills/image15.png) - -I ran four automated review passes — structural integrity, waza_quality scores, trigger precision, and an adversarial over-trimming check. Three passed or passed with notes. The adversarial pass caught two real blockers: a reference file with a broken relative path, and a skill trimmed past the point of usefulness — the `SKILL.md` was essentially just a title and a pointer, with no routing context left to tell an agent when or how to use it. - -Both issues were fixed and re-reviewed. Second pass: ✅ SHIP. - -The lesson from those blockers: don't reduce a `SKILL.md` below ~800 tokens. Below that, you risk losing enough routing context that agents can't determine when or how to use the skill. If your `SKILL.md` is just a title and a link to references, you've gone too far. - -### Final Numbers - -``` -Before: 413,591 tokens (117 skills) -After: 143,354 tokens (114 skills) -Saved: 270,237 tokens (65.3% reduction) -``` - -![Final summary: 413K → 143K tokens, 65.3% reduction](./media/2026-05-09-tuning-up-copilot-skills/image17.png) - -The 143K figure is pre-deduplication. The shared reference extraction in the next section further reduced maintenance overhead but didn't significantly change the token count — it consolidated duplicates rather than removing content. - -## Bonus Round: Something I Didn't Plan For - -After the optimization was done, I noticed something I'd missed in the planning phase. - -The 6 SDK skills had each independently created similar reference files during the refactoring. When I looked at them side by side: 86 files across 6 skills, with about 45% duplicated prose — generic best practices that apply to any language. TypeScript and Java both had essentially identical sections on error handling conventions, documentation standards, test coverage requirements. Written separately, maintained separately. - -That's six copies of the same thing I'd now have to update every time the guidance changed. - -The fix: create a shared reference directory (`shared-sdk-sample-review-references/`) with 14 files of generic prose. Each per-language skill keeps only its language-specific code examples, with a link to the shared counterpart at the top of each file. - -![SDK reference consolidation: before/after, single source of truth](./media/2026-05-09-tuning-up-copilot-skills/image21.png) - -Updating a best practice now means editing 1 file instead of 6. That's the kind of maintenance win that doesn't show up in token counts but pays back over time. - -## Dogfooding: The Reskill Skill - -The optimization pipeline worked well enough that I captured it as a skill — `dina-reskill` — documenting the 8-pattern optimization workflow (reference extraction, checklist compression, example pruning, and so on). - -Then, because I'm apparently incapable of leaving well enough alone, I ran `dina-reskill` on itself: - -``` -SKILL.md: 2,085 → 1,163 (44% reduction) -Total: 5,401 → 4,288 (21% reduction) -``` - -Three review passes: two clean approvals, one note flagged and fixed. - -The skill practices what it preaches. 🐕 - -## What Actually Worked: The Patterns - -My perspective on what to reach for first, ranked by impact: - -### 1. Reference Extraction - -This was the biggest single win by far. Move detailed rules, code examples, and verbose explanations into `references/` files. The `SKILL.md` becomes a routing layer — overview, quick checklist, blocker list. Agents load references on demand. For any skill over 5K tokens, this should be your first move. - -### 2. Checklist Compression - -Turn paragraph-style guidance into concise checklists. "When reviewing error handling, ensure that all errors are properly caught, logged with appropriate context, and returned with meaningful messages to the caller" becomes "✅ Errors: caught, logged with context, meaningful messages." Same information, fraction of the tokens. - -### 3. Example Pruning - -One good example per pattern. If your skill has 3 examples of the same concept, pick the clearest one and reference-extract the rest. - -### 4. Shared References - -If multiple skills share common guidance, extract it once and link. The `shared-sdk-sample-review-references/` pattern is the one I wish I'd designed from the start — it's a classic case of noticing the duplication only after you've already duplicated it everywhere. - -### 5. Stub Elimination - -If a skill just redirects to another skill, delete it. The router doesn't need a placeholder, and stubs will confuse future agents trying to decide what to use. - -## Honest Lessons: How I Should Have Run This - -I ran this over 8 user messages. Here's what that actually looked like, and what I'd do differently: - -| What Happened | What Would Have Been Better | -|---------------|---------------------------| -| "get ready" + "can you plan" (2 turns) | State the goal upfront with the tool name | -| "keep going" × 2 | "Run all phases, don't stop between them" | -| SDK dedup discovered late (turn 6–8) | Mention "deduplicate shared content" upfront | -| Asking about PR + review + results separately | Bundle deliverables: "PR, team review, results file" | - -The pattern I should have followed: front-load three things — (1) the tool or technique, (2) the full scope with known edge cases, (3) all the deliverables I want at the end. One prompt, not eight. - -The planning phase is cheap; the execution phase is expensive. I skipped the planning phase because I was impatient. I paid for it in "keep going" messages. - -## The Setup - -For reference, here's what I was running: - -- **[GitHub Copilot CLI](https://github.com/github/copilot-cli)** v1.0.40 -- **[Squad](https://github.com/bradygaster/squad)** v0.9.4-insider.1 for multi-agent orchestration -- **[microsoft/waza](https://github.com/microsoft/waza)** for skill quality analysis -- **Model:** Claude Opus 4.6 with 200K context window - -## Where to Go From Here - -If you're curious whether your own skills directory needs this treatment, `waza_tokens count` is the quick answer. If your total is over 100K tokens, you probably have meaningful room to optimize. If you have skills over 5K tokens, reference extraction is almost always worth it. - -I'm not going to hand you a checklist and call it a day — everyone's skill architecture is different, and the interesting work is figuring out which patterns actually fit your setup. But if you do try this and discover something that works or something that breaks badly, I'd genuinely be curious to hear what you found. - -Full session ran on May 9, 2026. 8 user messages, about 2 hours, 270K tokens saved. - ---- - -*Fun stuff!* The repo is at [github.com/diberry/project-dina](https://github.com/diberry/project-dina) if you want to dig into the skill structure directly. diff --git a/website/blog/2026-05-11-tuning-up-copilot-skills.md b/website/blog/2026-05-11-tuning-up-copilot-skills.md new file mode 100644 index 0000000..dd3694e --- /dev/null +++ b/website/blog/2026-05-11-tuning-up-copilot-skills.md @@ -0,0 +1,325 @@ +--- +slug: /2026-05-11-tuning-up-copilot-skills +canonical_url: https://dfberry.github.io/blog/2026-05-11-tuning-up-copilot-skills +custom_edit_url: null +sidebar_label: "2026.05.11 Tuning Up Copilot Skills" +title: "Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills" +description: "I had 413K tokens of unoptimized skills and a waza toolkit to diagnose them. Here's what I found, what surprised me, and what actually worked." +draft: false +tags: + - GitHub Copilot + - Skills + - Waza + - Token Optimization + - AI assisted + - Tutorial +updated: 2026-05-11 18:00 PST +keywords: + - copilot skills optimization + - waza tokens + - skill refactoring + - token reduction + - copilot cli + - copilot token budget + - skill.md optimization + - copilot skills tutorial +--- + +# Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills + +![Watercolor illustration of a craftsperson's workbench being tidied and organized](./media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop.png) + +Optimizing skills felt less like deleting content and more like reorganizing a workshop — same tools, better drawers. + +I'd been adding to the `.copilot/skills/` directory for a while without taking inventory. Every feature or domain onboarding meant a new skill — sometimes three. The assumption was obvious: more skills = more consistency. For the first few dozen, that was true. + +Here's what's weird: I had no actual count. + +When I finally looked: **413,591 tokens across 136 skill and reference files (117 distinct skills).** Just measuring it revealed the bloat: +- 6 SDK sample review skills: 140K tokens (34% of total budget) +- Dead redirect stubs: consuming tokens for no routing purpose +- Duplicated prose: same guidance repeated across language variants + +Not a disaster, but the kind of creeping growth that happens when you build fast and don't audit. + +**Why this matters:** Skills load on demand, so optimizing them doesn't free your active context window. But faster agent spawns and cheaper skill loading? That's a different lever, and I wanted to pull it. + +The patterns I found — reference extraction, checklist compression, shared references — work with any tool. I used GitHub Copilot CLI with Squad orchestration to run them in parallel, but you could apply them manually in any editor. The techniques are the point, not the tooling. + +## Measuring Token Usage with microsoft/waza + +The first move: measure. [Waza](https://github.com/microsoft/waza) is a skill quality toolkit, and `waza_tokens count` does exactly that — scans your skills directory and gives you sorted token usage. No guessing. Here's the breakdown: + +``` +$ waza_tokens count .copilot/skills/ +┌─────────────────────────────────┬────────┐ +│ Skill │ Tokens │ +├─────────────────────────────────┼────────┤ +│ data-plus-ai-sdk-java-sample... │ 25,841 │ +│ data-plus-ai-sdk-python-samp... │ 23,921 │ +│ ... │ │ +│ dina-small-utility │ 312 │ +├─────────────────────────────────┼────────┤ +│ Total: 117 skills │413,591 │ +└─────────────────────────────────┴────────┘ +``` + +25K tokens for a single skill. That's the starting point. Waza has other tools too — `waza_tokens suggest` for optimization ideas, `waza_quality` to verify post-changes, and `waza_dev --copilot` for frontmatter — but for this work, `count` was the diagnostic tool. + +## Planning the Work + +The strategy: decompose into phases, ordered by savings potential. **Clear the big ones first; small wins come after.** + +| Phase | Target | Est. Savings | +|-------|--------|-------------| +| 1. Kill stubs | 3 empty redirect skills | ~73 tokens | +| 2. Refactor giants | 6 SDK review skills (140K!) | ~120K tokens | +| 3. Optimize large | 14 skills (5K–10K each) | ~30–50K tokens | +| 4. Optimize medium | 60 skills (1K–5K each) | ~10–20K tokens | +| 5. Trim small | 20 skills (under 1K each) | minimal | +| 6. Audit references | Large reference files | ~10–15K tokens | + + + +**Why this order matters:** I started with stubs not because they saved much, but because they reduced noise before the real work. Phases 2–3 capture the bulk of savings. Phases 4–5 are diminishing returns per skill, but we completed them efficiently by applying the same patterns we'd already proven earlier. + +## Phase 1: Killing the Stubs + +**Problem:** Three skills were redirect stubs — they pointed to other skills and had fewer than 50 tokens of actual content. No routing logic, no value. + +**Action:** Deleted them. + +**Result:** −73 tokens. Barely registers numerically, but this is the "boring is good" work. A clean directory is easier to maintain, and stubs confuse future maintainers. + +## Phase 2: The Giants + +**Problem:** Six SDK sample review skills (Java, Python, Go, .NET, TypeScript, Rust) had identical structure: 15–16 detailed rule sections + full code examples inline. Total: 140K tokens (34% of budget). Agents loaded everything every time, even when they only needed one language's rules. + +**Technique: Reference Extraction.** Move verbose rules and examples to `references/` files. Keep `SKILL.md` slim — just routing info, a quick checklist, and blockers. Agents load the overview immediately, fetch detailed rules on demand. + +**Before:** +``` +java-sdk-review/SKILL.md (25,841 tokens) +├── Routing info (2K) +├── Error handling rules (8K + full examples) +├── Concurrency rules (7K + full examples) +├── Async patterns (6K + full examples) +└── ... 12 more sections ... +``` + +**After:** +``` +java-sdk-review/SKILL.md (1,541 tokens) +├── Routing: detect Java SDK samples +├── Quick checklist +│ ├── Error handling: caught, logged, meaningful messages +│ ├── Concurrency: thread-safe, no race conditions +│ ├── Async patterns: proper callback/future chaining +│ └── ... 5 more items +└── Reference files in references/java/ (loaded on demand) +``` + +Two-tier architecture. Same content, loaded smarter. + +**Execution:** Ran all six in parallel: + +| Skill | Before | After | Reduction | +|-------|--------|-------|-----------| +| Java | 25,841 | 1,541 | **94%** | +| Python | 23,921 | 1,083 | **95%** | +| Go | 24,355 | 1,815 | **93%** | +| .NET | 23,355 | 1,378 | **94%** | +| TypeScript | 21,543 | 1,525 | **93%** | +| Rust | 21,303 | 1,643 | **92%** | +| **Total** | **140,318** | **8,985** | **~131K saved** | + +**Trade-off:** Agents now navigate a two-tier structure (SKILL.md → references/) instead of one flat file. Discoverability costs something. But these skills are used frequently enough that agents will learn the pattern. Zero content was removed — every rule and example is still there, just reference-extracted. + +![Phase 2 complete: SDK skills before/after showing 94%+ reduction per language](./media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.png) + +## Phase 3: Large Skills + +**Problem:** 14 more skills in the 5K–10K range had the same structure: verbose sections that could be extracted. Examples: `azure-mcp-content-generation`, `dina-reskill`, `context-diagnostics`. + +**Action:** Applied the same reference extraction pattern in 4 parallel batches. + +**Result:** −68,084 tokens (76% reduction) + +**Running total:** +``` +Phase 1 (stubs): −73 tokens +Phase 2 (giants): −131,333 tokens +Phase 3 (large): −68,084 tokens +──────────────────────────────── +Subtotal saved: ~199,490 tokens (48% of starting budget) +Remaining: ~214,101 tokens +``` + +At this point, the curve was clear. Phases 4–5 (medium and small skills) would yield diminishing returns per unit effort — but having proven the techniques in Phases 1–3, we already knew how to apply them efficiently at scale. + +## The PR and the Review + +**Problem:** The PR touched 106 files, 18,571 deletions, 12,176 insertions. Before shipping, we needed to verify: +- Structural integrity (paths, syntax, references valid) +- Quality didn't regress (waza_quality scores) +- Routing logic still precise +- Didn't over-trim skills below usefulness + +**Action:** Ran four automated review passes: +1. Structural integrity check — passed +2. Waza quality verification — passed with notes +3. Trigger precision validation — passed with notes +4. Adversarial over-trimming check — **caught 2 real issues** + +**Issues found and fixed:** +1. Reference file with broken relative path (in Phase 2) +2. Skill trimmed below ~800 tokens (lost routing context entirely) + +**Second pass:** ✅ SHIP + +![Pull request showing 65% Copilot skills token reduction across 106 files](./media/2026-05-11-tuning-up-copilot-skills/pr-summary.png) + +**Key finding:** Don't reduce a `SKILL.md` below ~800 tokens for standalone skills. Below that threshold, you lose enough routing context that agents can't determine when or how to use the skill. **Exception:** Skills with strong internal routing logic (like the unified SDK skill at 469 tokens) can go lower because their dispatch logic compensates. + +The ~800-token floor is a practical boundary, discovered through testing. + +## Phases 4–6: The Curve Flattens + +After Phase 3, ~214K tokens remained. Phases 4–6 brought that down to 143K — another ~70K saved by applying the same patterns (checklist compression, reference extraction, deduplication) at smaller scale: + +| Phase | Skills | Technique | Savings | +|-------|--------|-----------|---------| +| 4. Medium | 60 skills (1K–5K each) | Checklist compression, dedup | ~45K | +| 5. Small | 20 skills (under 1K each) | Light trimming | ~10K | +| 6. Reference audit | Large reference files | Consolidation | ~15K | + +The per-skill ROI drops in later phases, but having proven the techniques in Phases 1–3, the work was mechanical — same patterns, smaller targets. The current 143K total is sustainable for the usage pattern. + +### Final Numbers + +``` +Before: 413,591 tokens (117 skills, 136 files) +After: 143,354 tokens (114 skills, 130 files) +Saved: 270,237 tokens (65.3% reduction) +``` + +This reflects the main optimization PR. The workbench is cleaner — every tool is still there, but they're in labeled drawers instead of piled on the surface. The Bonus Round consolidation happened in a separate session and is described next. + +## Bonus Round: From Shared References to Unified Skills + +After the optimization PR shipped, I ran `waza_quality` on the six SDK skills and noticed: **isolation violation**. Each skill had its own `SKILL.md`, its own routing, its own boilerplate duplicated across six files. That's a pattern violation — not atomic, not clean. + +So I rethought it. Three options existed: + +1. **Accept the low score** — good-enough for a domain-specific exception +2. **Inline shared content** — copy the 14 shared reference files into each skill, double the maintenance burden +3. **Single skill with language dispatch** — collapse all 7 (6 SDK languages + 1 quickstart) into one unified skill with language auto-detection + +I went with option 3. Not because it was obvious, but because it matched the actual use case: agents almost never review samples in *all* languages simultaneously. They review one language based on the codebase. A single skill with smart routing was actually more correct than the multi-skill pretense. + +Result: `azure-sdk-sample-review/` — one skill, 469 tokens in `SKILL.md`, language auto-detection via prompt analysis. Structure: + +``` +azure-sdk-sample-review/ +├── SKILL.md (469 tokens) — routing + dispatch logic +├── evals/ (7 tasks, 100% passing) +├── references/ +│ ├── shared/ (14 files: generic best practices) +│ ├── dotnet/, go/, java/, python/, rust/, typescript/ +│ └── quickstart/ +``` + +![SDK reference consolidation: before/after, single source of truth](./media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.png) + +Eliminated six duplicated routing SKILL.md files. One dispatch mechanism instead of six. Waza compliance achieved — no more isolation violations. All 7 behavioral evals running at 100%. + +This is the evolution: **reference extraction → shared references → unified skill with internal routing.** Each step felt right at the time. Looking back, the final architecture is simpler and more correct. Wish I'd seen it from the start. (Work captured in a follow-up PR.) + +## Dogfooding: The Reskill Skill + +I captured the optimization pipeline as a skill — `dina-reskill` — documenting the 8-pattern workflow (reference extraction, checklist compression, example pruning, and so on). + +Then I ran it on itself, because apparently I can't leave well enough alone: + +``` +SKILL.md: 2,085 → 1,163 tokens (44% reduction) +Total: 5,401 → 4,288 tokens (21% reduction) +``` + +Three review passes: two approvals, one note flagged and fixed. The skill practices what it documents. The SDK skills themselves evolved further after this (described in Bonus Round) — from six separate skills down to a single unified skill. So while `dina-reskill` captures the second-pass improvements here, the SDK consolidation shows how those patterns continue to evolve as you live with them. + +## What Actually Worked: The Patterns + +If you're building a skills optimization workflow, here are the patterns ranked by impact: + +### 1. Reference Extraction + +**Principle:** Move detailed rules, code examples, and verbose explanations into `references/` files. The `SKILL.md` becomes a slim routing layer — overview, quick checklist, blocker list. Agents load references on demand. + +**When to use:** For any skill over 5K tokens, this should be your first move. Start here, not somewhere else. + +**Example:** The Java SDK skill went from 25,841 tokens (inline rules + examples) to 1,541 tokens (routing + checklist) by extracting ~24K into `references/java/`. + +### 2. Checklist Compression + +**Principle:** Turn paragraph-style guidance into concise checklists. Same information, fraction of the tokens. + +**Example:** +- Before: "When reviewing error handling, ensure that all errors are properly caught, logged with appropriate context, and returned with meaningful messages to the caller" +- After: "✅ Errors: caught, logged with context, meaningful messages" + +### 3. Example Pruning + +**Principle:** One good example per pattern. If your skill has 3 examples of the same concept, keep the clearest one and move the others to references. + +### 4. Shared References → Unified Skill Routing + +**Principle:** If multiple skills share common guidance, the first instinct is to extract it once and link. That works, but it's often a stepping stone to something better: collapsing near-identical skills into one skill with internal dispatch logic. + +**When this works:** When you have N near-identical skills differing only by one dimension (language, framework, etc.), a unified skill with auto-detection is cleaner than N separate skills with shared references. + +**Trade-off:** One `SKILL.md`, one set of evals, one routing boundary. Zero isolation violations. But your `SKILL.md` becomes more complex. + +### 5. Stub Elimination + +**Principle:** If a skill just redirects to another skill, delete it. The router doesn't need a placeholder, and stubs confuse future agents trying to decide what to use. + +## Honest Lessons: How I Should Have Run This + +The work happened over 8 user messages and 2 hours. Here's what went sideways and what would have prevented it: + +| What Happened | What Would Have Been Better | +|---------------|---------------------------| +| SDK dedup discovered late (turn 6–8) | Mention "deduplicate shared content" upfront as a known phase | +| Asking about PR + review + results separately | Bundle deliverables: "PR, team review, results file" in one request | +| Phases 4–5 required separate prompts | Front-load scope: "all phases including medium skills" keeps momentum | + +**The pattern that would have worked:** Front-load three things upfront: +1. The technique or tool (`waza_tokens`, reference extraction, etc.) +2. Full scope with known edge cases (all 6 phases, ~800-token floor, etc.) +3. All deliverables you want at the end + +Front-load scope, technique, and deliverables in one message. The AI doesn't lose patience — you do. Every "keep going" prompt is a planning failure you're paying for at execution prices. + +## The Setup + +For reference, here's what I was running: + +- **[GitHub Copilot CLI](https://github.com/github/copilot-cli)** v1.0.40 +- **[Squad](https://github.com/bradygaster/squad)** v0.9.4-insider.1 for multi-agent orchestration +- **[microsoft/waza](https://github.com/microsoft/waza)** for skill quality analysis +- **Model:** Claude Opus 4.6 with 200K context window + +## Where to Go From Here + +To determine if your own skills directory needs this treatment: run `waza_tokens count` and see the total. **If it's over 100K tokens, you have meaningful room to optimize.** If you have skills over 5K tokens, reference extraction is almost always worth it. + +Everyone's skill architecture is different — the interesting work is figuring out which patterns actually fit your setup. If you try these and discover something that works or something that breaks, I'd be curious to hear what you found. + + + +![The same workbench, now organized — tools on pegboard, labeled drawers, clean surface](./media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop-after.png) + +Same workshop. Same tools. Better organized. That's what 270K tokens of optimization looks like. + +Main optimization session: May 11, 2026. 8 user messages, ~2 hours, 270K tokens saved. The Bonus Round consolidation happened in a follow-up session. diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image11.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image11.png deleted file mode 100644 index 39685f0..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image11.png and /dev/null differ diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image14.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image14.png deleted file mode 100644 index 6e41c52..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image14.png and /dev/null differ diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image15.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image15.png deleted file mode 100644 index fae0887..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image15.png and /dev/null differ diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image17.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image17.png deleted file mode 100644 index f74f611..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image17.png and /dev/null differ diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image21.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image21.png deleted file mode 100644 index 3a75f54..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image21.png and /dev/null differ diff --git a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image6.png b/website/blog/media/2026-05-09-tuning-up-copilot-skills/image6.png deleted file mode 100644 index 81c7867..0000000 Binary files a/website/blog/media/2026-05-09-tuning-up-copilot-skills/image6.png and /dev/null differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.mmd new file mode 100644 index 0000000..2ce3dd0 --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.mmd @@ -0,0 +1,18 @@ +flowchart TD + start["📊 Starting Total
413,591 tokens"] + phase1["Phase 1: Kill Stubs
Saved: 73 tokens"] + after1["Remaining: 413,518"] + phase2["Phase 2: Refactor SDK Giants
Saved: 131,333 tokens"] + after2["Remaining: 282,185"] + phase3["Phase 3: Optimize Large Skills
Saved: 68,084 tokens"] + after3["✅ Remaining: 214,101
Total Saved: ~199,490"] + + start --> phase1 --> after1 --> phase2 --> after2 --> phase3 --> after3 + + style start fill:#334155,stroke:#94a3b8,color:#f1f5f9 + style phase1 fill:#1e3a5f,stroke:#60a5fa,color:#e0f2fe + style after1 fill:#1e293b,stroke:#64748b,color:#e2e8f0 + style phase2 fill:#7f1d1d,stroke:#f87171,color:#fef2f2 + style after2 fill:#1e293b,stroke:#64748b,color:#e2e8f0 + style phase3 fill:#3b2f0a,stroke:#facc15,color:#fefce8 + style after3 fill:#052e16,stroke:#22c55e,color:#bbf7d0 diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.png new file mode 100644 index 0000000..a79de23 Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/cumulative-savings.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.mmd new file mode 100644 index 0000000..832e19f --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.mmd @@ -0,0 +1,13 @@ +flowchart LR + before["🔴 BEFORE
413,591 tokens
117 skills"] + arrow["➡️ 65.3%
REDUCTION"] + after["🟢 AFTER
143,354 tokens
114 skills"] + saved["💰 270,237 tokens saved"] + + before --> arrow --> after + arrow --> saved + + style before fill:#7f1d1d,stroke:#ef4444,color:#fecaca,font-size:18px + style arrow fill:#1a1a3b,stroke:#a78bfa,color:#ede9fe,font-size:20px + style after fill:#052e16,stroke:#22c55e,color:#bbf7d0,font-size:18px + style saved fill:#3b2f0a,stroke:#facc15,color:#fefce8,font-size:16px diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.png new file mode 100644 index 0000000..ed0713e Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/final-token-reduction.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop-after.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop-after.png new file mode 100644 index 0000000..bc3df2e Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop-after.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop.png new file mode 100644 index 0000000..ea94beb Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/hero-skill-workshop.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.mmd new file mode 100644 index 0000000..a4958e5 --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.mmd @@ -0,0 +1,18 @@ +flowchart TD + baseline["🎯 Baseline: 413,591 tokens
117 skills across 136 files"] + p1["Phase 1: Kill Stubs
3 empty redirect skills
~73 tokens saved"] + p2["Phase 2: Refactor SDK Giants
6 SDK review skills (140K!)
~120K tokens saved"] + p3["Phase 3: Optimize Large Skills
14 skills (5K–10K each)
~50K tokens saved"] + p4["Phase 4: Optimize Medium Skills
60 skills (1K–5K each)
~20K tokens saved"] + p5["Phase 5: Trim Small Skills
20 skills (under 1K)
Minimal savings"] + p6["Phase 6: Audit References
Large reference files
~15K tokens saved"] + + baseline --> p1 --> p2 --> p3 --> p4 --> p5 --> p6 + + style baseline fill:#334155,stroke:#94a3b8,color:#f1f5f9 + style p1 fill:#1e3a5f,stroke:#60a5fa,color:#e0f2fe + style p2 fill:#7f1d1d,stroke:#f87171,color:#fef2f2 + style p3 fill:#3b2f0a,stroke:#facc15,color:#fefce8 + style p4 fill:#1a3a2a,stroke:#4ade80,color:#f0fdf4 + style p5 fill:#1e293b,stroke:#64748b,color:#e2e8f0 + style p6 fill:#1e293b,stroke:#64748b,color:#e2e8f0 diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.png new file mode 100644 index 0000000..dbac953 Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/optimization-phases-plan.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.mmd new file mode 100644 index 0000000..a5ef15f --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.mmd @@ -0,0 +1,23 @@ +flowchart TD + pr["🔀 PR #147
Copilot Skills Token Optimization"] + stats["📁 106 files changed
+12,176 insertions / −18,571 deletions"] + + subgraph reviews["4 Review Passes"] + r1["✅ Structural Integrity"] + r2["✅ waza_quality Scores"] + r3["✅ Trigger Precision"] + r4["⚠️→✅ Adversarial Check
2 blockers found and fixed"] + end + + result["🚀 SHIP
65% token reduction achieved"] + + pr --> stats --> reviews --> result + + style pr fill:#1e3a5f,stroke:#60a5fa,color:#e0f2fe + style stats fill:#334155,stroke:#94a3b8,color:#f1f5f9 + style reviews fill:#1a1a2e,stroke:#64748b,color:#e2e8f0 + style r1 fill:#052e16,stroke:#22c55e,color:#bbf7d0 + style r2 fill:#052e16,stroke:#22c55e,color:#bbf7d0 + style r3 fill:#052e16,stroke:#22c55e,color:#bbf7d0 + style r4 fill:#3b2f0a,stroke:#facc15,color:#fefce8 + style result fill:#0a3b1a,stroke:#4ade80,color:#f0fdf4 diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.png new file mode 100644 index 0000000..9af9f68 Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/pr-summary.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.mmd new file mode 100644 index 0000000..3c15375 --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.mmd @@ -0,0 +1,43 @@ +flowchart LR + subgraph before["BEFORE: 140,318 tokens"] + j1["☕ Java
25,841"] + py1["🐍 Python
23,921"] + go1["🔷 Go
24,355"] + dn1["🟣 .NET
23,355"] + ts1["📘 TypeScript
21,543"] + rs1["🦀 Rust
21,303"] + end + + subgraph after["AFTER: 8,985 tokens"] + j2["☕ Java
1,541"] + py2["🐍 Python
1,083"] + go2["🔷 Go
1,815"] + dn2["🟣 .NET
1,378"] + ts2["📘 TypeScript
1,525"] + rs2["🦀 Rust
1,643"] + end + + j1 -->|"94%"| j2 + py1 -->|"95%"| py2 + go1 -->|"93%"| go2 + dn1 -->|"94%"| dn2 + ts1 -->|"93%"| ts2 + rs1 -->|"92%"| rs2 + + saved["💰 131,333 tokens saved"] + + style before fill:#3b1010,stroke:#f87171,color:#fef2f2 + style after fill:#0a3b1a,stroke:#4ade80,color:#f0fdf4 + style saved fill:#1a1a3b,stroke:#a78bfa,color:#ede9fe + style j1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style py1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style go1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style dn1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style ts1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style rs1 fill:#4a1515,stroke:#ef4444,color:#fecaca + style j2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style py2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style go2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style dn2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style ts2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style rs2 fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.png new file mode 100644 index 0000000..4b1e740 Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/sdk-skills-before-after.png differ diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.mmd b/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.mmd new file mode 100644 index 0000000..3646b3d --- /dev/null +++ b/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.mmd @@ -0,0 +1,45 @@ +flowchart LR + subgraph before_state["BEFORE: 86 files, 45% duplicated"] + b_java["☕ Java
references/"] + b_python["🐍 Python
references/"] + b_go["🔷 Go
references/"] + b_dotnet["🟣 .NET
references/"] + b_ts["📘 TypeScript
references/"] + b_rust["🦀 Rust
references/"] + end + + subgraph after_state["AFTER: Single source of truth"] + shared["📦 shared-sdk-references/
14 generic files"] + a_java["☕ Java-specific only"] + a_python["🐍 Python-specific only"] + a_go["🔷 Go-specific only"] + a_dotnet["🟣 .NET-specific only"] + a_ts["📘 TS-specific only"] + a_rust["🦀 Rust-specific only"] + + a_java --> shared + a_python --> shared + a_go --> shared + a_dotnet --> shared + a_ts --> shared + a_rust --> shared + end + + update["✏️ Update 1 file, not 6"] + + style before_state fill:#3b1010,stroke:#f87171,color:#fef2f2 + style after_state fill:#0a3b1a,stroke:#4ade80,color:#f0fdf4 + style shared fill:#1a1a3b,stroke:#a78bfa,color:#ede9fe + style update fill:#3b2f0a,stroke:#facc15,color:#fefce8 + style b_java fill:#4a1515,stroke:#ef4444,color:#fecaca + style b_python fill:#4a1515,stroke:#ef4444,color:#fecaca + style b_go fill:#4a1515,stroke:#ef4444,color:#fecaca + style b_dotnet fill:#4a1515,stroke:#ef4444,color:#fecaca + style b_ts fill:#4a1515,stroke:#ef4444,color:#fecaca + style b_rust fill:#4a1515,stroke:#ef4444,color:#fecaca + style a_java fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style a_python fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style a_go fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style a_dotnet fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style a_ts fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 + style a_rust fill:#0a3b1a,stroke:#22c55e,color:#bbf7d0 diff --git a/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.png b/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.png new file mode 100644 index 0000000..997adf6 Binary files /dev/null and b/website/blog/media/2026-05-11-tuning-up-copilot-skills/shared-references-architecture.png differ