The fundamental problem: scores measure presence, not quality
The learn command's premise is that high-scoring attributes represent genuinely good examples worth extracting as reusable skills. This premise is false.
A score of 100 on claude_md_file means the file exists and contains more than 50 bytes. The content could be completely useless. type_annotations scores on what percentage of functions have type hints, not whether the types are correct or meaningful. lock_files scores on whether a lock file exists, not whether dependency management is actually healthy. This pattern holds across all five attributes the feature can generate skills for.
So learn does not find excellent implementations and extract wisdom from them. It finds repos where certain files exist and treats them as gold standards. The ETH Zurich study (Feb 2026) found that auto-generated context files actually hurt performance (-3% success rate). A skill derived from an arbitrary CLAUDE.md that happened to exceed 50 bytes is closer to that category than to a genuinely useful example.
For learning to work in principle, the assessors would need to measure quality - which would require semantic understanding of content, not file existence and byte counts. That is a substantially harder problem and arguably the one that warrants AI involvement, not the downstream skill-generation step.
Additional implementation problems
Even setting the fundamental issue aside, the implementation has several problems:
Hardcoded to 5 skills out of 25 assessors. The pattern extractor has a hardcoded list of exactly 5 attribute IDs that can ever produce a skill. Any attribute not in that list is silently skipped. New assessors will never generate skills regardless of their scores. The feature is already stale relative to the codebase it lives in.
The metrics are fabricated. The output displays "confidence", "impact", and "reusability" scores that sound empirically derived but aren't. Confidence is just the finding's score. Impact is hardcoded by tier. Reusability is 100 - (tier - 1) * 20 - an arbitrary formula.
The LLM output is poorly assembled. DiscoveredSkill has no dedicated fields for instructions, best practices, or anti-patterns, so _merge_enrichment appends all of them into the code_examples list as === SECTION ===-delimited text blocks.
Code sampling is arbitrary. For type annotations, the sampler uses **/*.py and grabs up to 5 arbitrary Python files. Whether those files demonstrate the pattern well is luck.
The CI workflow compounds the problem
.github/workflows/continuous-learning.yml runs on every release and weekly on Sundays. It runs agentready assess, then agentready learn, and depending on configuration either opens GitHub issues for each discovered skill or creates a PR copying SKILL.md files into .claude/skills/. Since the workflow does not set ANTHROPIC_API_KEY, LLM enrichment is silently disabled every time it runs. The workflow is producing unenriched, low-value skill proposals on a recurring schedule from files that only proved they exist.
Recommendation
Remove the learn and extract-skills commands, the learners/ module, and .github/workflows/continuous-learning.yml. If a skill-generation feature is revisited, it should start from assessors that genuinely measure implementation quality rather than presence.
Opened by Claude Code under the supervision of Bill Murdock.
The fundamental problem: scores measure presence, not quality
The
learncommand's premise is that high-scoring attributes represent genuinely good examples worth extracting as reusable skills. This premise is false.A score of 100 on
claude_md_filemeans the file exists and contains more than 50 bytes. The content could be completely useless.type_annotationsscores on what percentage of functions have type hints, not whether the types are correct or meaningful.lock_filesscores on whether a lock file exists, not whether dependency management is actually healthy. This pattern holds across all five attributes the feature can generate skills for.So
learndoes not find excellent implementations and extract wisdom from them. It finds repos where certain files exist and treats them as gold standards. The ETH Zurich study (Feb 2026) found that auto-generated context files actually hurt performance (-3% success rate). A skill derived from an arbitrary CLAUDE.md that happened to exceed 50 bytes is closer to that category than to a genuinely useful example.For learning to work in principle, the assessors would need to measure quality - which would require semantic understanding of content, not file existence and byte counts. That is a substantially harder problem and arguably the one that warrants AI involvement, not the downstream skill-generation step.
Additional implementation problems
Even setting the fundamental issue aside, the implementation has several problems:
Hardcoded to 5 skills out of 25 assessors. The pattern extractor has a hardcoded list of exactly 5 attribute IDs that can ever produce a skill. Any attribute not in that list is silently skipped. New assessors will never generate skills regardless of their scores. The feature is already stale relative to the codebase it lives in.
The metrics are fabricated. The output displays "confidence", "impact", and "reusability" scores that sound empirically derived but aren't. Confidence is just the finding's score. Impact is hardcoded by tier. Reusability is
100 - (tier - 1) * 20- an arbitrary formula.The LLM output is poorly assembled.
DiscoveredSkillhas no dedicated fields for instructions, best practices, or anti-patterns, so_merge_enrichmentappends all of them into thecode_exampleslist as=== SECTION ===-delimited text blocks.Code sampling is arbitrary. For type annotations, the sampler uses
**/*.pyand grabs up to 5 arbitrary Python files. Whether those files demonstrate the pattern well is luck.The CI workflow compounds the problem
.github/workflows/continuous-learning.ymlruns on every release and weekly on Sundays. It runsagentready assess, thenagentready learn, and depending on configuration either opens GitHub issues for each discovered skill or creates a PR copying SKILL.md files into.claude/skills/. Since the workflow does not setANTHROPIC_API_KEY, LLM enrichment is silently disabled every time it runs. The workflow is producing unenriched, low-value skill proposals on a recurring schedule from files that only proved they exist.Recommendation
Remove the
learnandextract-skillscommands, thelearners/module, and.github/workflows/continuous-learning.yml. If a skill-generation feature is revisited, it should start from assessors that genuinely measure implementation quality rather than presence.Opened by Claude Code under the supervision of Bill Murdock.