Pick a canonical llms.txt when multiple files are discovered#54
Open
SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
Open
Pick a canonical llms.txt when multiple files are discovered#54SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
Conversation
When a site has both an apex `llms.txt` (often a marketing index) and a docs-section `llms.txt`, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53. This change introduces a single canonical pick: - `llms-txt-exists` selects one file as canonical using a most-specific- -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt` when the user passed `/docs`, and the other way around when they passed the origin). Ties resolve to the candidate-discovery order. - Downstream checks (`llms-txt-size`, `llms-txt-valid`, `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`) now operate on that single file via a new `getLlmsTxtFilesForAnalysis` helper. `cache-header-hygiene` keeps probing every discovered file. - A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets users point afdocs at an explicit canonical, bypassing the heuristic. When set, only that URL is probed and the cross-host fallback is skipped. - The pass message now surfaces which file was chosen, and `details` exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for scripted consumers. Made-with: Cursor
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #53.
Summary
When a site has both an apex
llms.txt(e.g., for marketing) and a docs-sectionllms.txt, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal.This PR makes
llms-txt-existspick a single canonical file, and downstream checks operate on that one file.Selection rule
Most-specific prefix of the
baseUrlwins. Among the discovered candidates:example.com/docs/llms.txt,/docs/llms.txt/docs/llms.txtexample.com/llms.txt,/docs/llms.txt/llms.txtexample.com/docs/v1/llms.txt,/docs/llms.txt,/docs/v1/llms.txt/docs/v1/llms.txtexample.com/docs/v1/llms.txt,/docs/llms.txt/docs/llms.txtTies resolve to candidate discovery order (already
baseUrl→origin→docs). Files on a different origin (e.g., discovered via cross-host redirect fallback) score below any same-origin prefix match.What downstream checks do now
llms-txt-size,llms-txt-valid,llms-txt-links-resolve,llms-txt-links-markdowngetUrlsFromCachedLlmsTxt,fetchLlmsTxtUrls)llms-txt-freshnesscache-header-hygieneThe full list of discovered files is still available in
details.discoveredFilesfor visibility.New fields:
details.canonicalLlmsTxt: the chosen filedetails.canonicalUrl: convenience stringdetails.canonicalSource:'heuristic'or'explicit'(when multiple files exist or--llms-txt-urlis used)The
llms-txt-existspass message also names the canonical file inline:New
--llms-txt-urlflagFor cases where the heuristic isn’t correct (non-standard path, monorepo, pre-publish verification), users can specify the canonical explicitly:
Also accepted as
options.llmsTxtUrlinagent-docs.config.yml.When set:
Real-world evidence: alchemy.com/docs
The issue reporter’s example.
Before this PR, scoring
https://www.alchemy.com/docspicked the 159K-character apex marketingllms.txtas canonical (because it was larger and discovered alongside the docs file).alchemy.com/llms.txt(159K, ~683 marketing links)alchemy.com/docs/llms.txt(495 chars, 6 docs sections)llms-txt-sizellms-txt-validllms-txt-links-resolvellms-txt-links-markdown/blog/,/case-studies/,/overviews//docs/llms-txt-sizefailure)The remaining failures after this PR (e.g.,
content-negotiation,page-size-html) are real signal about the docs section that the apex was previously masking.Test plan
24 new tests across:
test/unit/helpers/llms-txt.test.tstest/unit/checks/llms-txt-exists.test.tstest/unit/cli/check-command.test.tstest/integration/check-pipeline.test.tsIncludes:
--llms-txt-urlhappy path, missing target, and cross-origin warningAll 849 existing tests pass
Lint clean (
npm run lint)Verified live against
https://www.alchemy.com/docsFiles
A follow-up PR (#55) addresses a related bug discovered while testing this change: our aggregate
.txtwalker only descends one level, so multi-level nested indexes like Alchemy's chains structure are undercounted byllms-txt-freshness. That fix is intentionally scoped separately.