Skip to content

Remove the learn command #395

@jwm4

Description

@jwm4

The fundamental problem: scores measure presence, not quality

The learn command's premise is that high-scoring attributes represent genuinely good examples worth extracting as reusable skills. This premise is false.

A score of 100 on claude_md_file means the file exists and contains more than 50 bytes. The content could be completely useless. type_annotations scores on what percentage of functions have type hints, not whether the types are correct or meaningful. lock_files scores on whether a lock file exists, not whether dependency management is actually healthy. This pattern holds across all five attributes the feature can generate skills for.

So learn does not find excellent implementations and extract wisdom from them. It finds repos where certain files exist and treats them as gold standards. The ETH Zurich study (Feb 2026) found that auto-generated context files actually hurt performance (-3% success rate). A skill derived from an arbitrary CLAUDE.md that happened to exceed 50 bytes is closer to that category than to a genuinely useful example.

For learning to work in principle, the assessors would need to measure quality - which would require semantic understanding of content, not file existence and byte counts. That is a substantially harder problem and arguably the one that warrants AI involvement, not the downstream skill-generation step.

Additional implementation problems

Even setting the fundamental issue aside, the implementation has several problems:

Hardcoded to 5 skills out of 25 assessors. The pattern extractor has a hardcoded list of exactly 5 attribute IDs that can ever produce a skill. Any attribute not in that list is silently skipped. New assessors will never generate skills regardless of their scores. The feature is already stale relative to the codebase it lives in.

The metrics are fabricated. The output displays "confidence", "impact", and "reusability" scores that sound empirically derived but aren't. Confidence is just the finding's score. Impact is hardcoded by tier. Reusability is 100 - (tier - 1) * 20 - an arbitrary formula.

The LLM output is poorly assembled. DiscoveredSkill has no dedicated fields for instructions, best practices, or anti-patterns, so _merge_enrichment appends all of them into the code_examples list as === SECTION ===-delimited text blocks.

Code sampling is arbitrary. For type annotations, the sampler uses **/*.py and grabs up to 5 arbitrary Python files. Whether those files demonstrate the pattern well is luck.

The CI workflow compounds the problem

.github/workflows/continuous-learning.yml runs on every release and weekly on Sundays. It runs agentready assess, then agentready learn, and depending on configuration either opens GitHub issues for each discovered skill or creates a PR copying SKILL.md files into .claude/skills/. Since the workflow does not set ANTHROPIC_API_KEY, LLM enrichment is silently disabled every time it runs. The workflow is producing unenriched, low-value skill proposals on a recurring schedule from files that only proved they exist.

Recommendation

Remove the learn and extract-skills commands, the learners/ module, and .github/workflows/continuous-learning.yml. If a skill-generation feature is revisited, it should start from assessors that genuinely measure implementation quality rather than presence.


Opened by Claude Code under the supervision of Bill Murdock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions