Skip to content

Pick a canonical llms.txt when multiple files are discovered#54

Open
SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
SahilAujla:fix/canonical-llms-txt-selection
Open

Pick a canonical llms.txt when multiple files are discovered#54
SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
SahilAujla:fix/canonical-llms-txt-selection

Conversation

@SahilAujla
Copy link
Copy Markdown

@SahilAujla SahilAujla commented Apr 23, 2026

Closes #53.

Summary

When a site has both an apex llms.txt (e.g., for marketing) and a docs-section llms.txt, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal.

This PR makes llms-txt-exists pick a single canonical file, and downstream checks operate on that one file.

Selection rule

Most-specific prefix of the baseUrl wins. Among the discovered candidates:

Base URL passed Files found Canonical
example.com/docs /llms.txt, /docs/llms.txt /docs/llms.txt
example.com /llms.txt, /docs/llms.txt /llms.txt
example.com/docs/v1 /llms.txt, /docs/llms.txt, /docs/v1/llms.txt /docs/v1/llms.txt
example.com/docs/v1 /llms.txt, /docs/llms.txt /docs/llms.txt

Ties resolve to candidate discovery order (already baseUrlorigindocs). Files on a different origin (e.g., discovered via cross-host redirect fallback) score below any same-origin prefix match.

What downstream checks do now

Check Behaviour
llms-txt-size, llms-txt-valid, llms-txt-links-resolve, llms-txt-links-markdown Operate on the canonical file only
Sampling (getUrlsFromCachedLlmsTxt, fetchLlmsTxtUrls) Pulls links from the canonical file only
llms-txt-freshness Compares the canonical file against sitemap
cache-header-hygiene Unchanged — probes all discovered files

The full list of discovered files is still available in details.discoveredFiles for visibility.

New fields:

  • details.canonicalLlmsTxt: the chosen file
  • details.canonicalUrl: convenience string
  • details.canonicalSource: 'heuristic' or 'explicit' (when multiple files exist or --llms-txt-url is used)

The llms-txt-exists pass message also names the canonical file inline:

PASS  llms-txt-exists  llms.txt found at 2 locations; using https://www.alchemy.com/docs/llms.txt as canonical

New --llms-txt-url flag

For cases where the heuristic isn’t correct (non-standard path, monorepo, pre-publish verification), users can specify the canonical explicitly:

afdocs check https://example.com/docs --llms-txt-url https://example.com/internal/llms.txt

Also accepted as options.llmsTxtUrl in agent-docs.config.yml.

When set:

  • Only that URL is probed
  • Cross-host fallback is skipped
  • If the URL doesn’t resolve, the check fails with an explicit message (no fallback)

Real-world evidence: alchemy.com/docs

The issue reporter’s example.

Before this PR, scoring https://www.alchemy.com/docs picked the 159K-character apex marketing llms.txt as canonical (because it was larger and discovered alongside the docs file).

Metric Before After
Canonical alchemy.com/llms.txt (159K, ~683 marketing links) alchemy.com/docs/llms.txt (495 chars, 6 docs sections)
llms-txt-size FAIL (158,998 chars) PASS (495 chars)
llms-txt-valid WARN (apex missing blockquote) PASS
llms-txt-links-resolve 19/50 (sampled marketing links) PASS — 6/6 same-origin
llms-txt-links-markdown 19/50 PASS — 6/6 (100%) markdown
Sampled URLs /blog/, /case-studies/, /overviews/ All under /docs/
Content Discoverability capped at D (llms-txt-size failure) 100/100 (A+)
Overall score 68 (D) 88 (B)

The remaining failures after this PR (e.g., content-negotiation, page-size-html) are real signal about the docs section that the apex was previously masking.

Test plan

  • 24 new tests across:

    • test/unit/helpers/llms-txt.test.ts
    • test/unit/checks/llms-txt-exists.test.ts
    • test/unit/cli/check-command.test.ts
    • test/integration/check-pipeline.test.ts
  • Includes:

    • canonical selection across all path-prefix scenarios
    • --llms-txt-url happy path, missing target, and cross-origin warning
    • end-to-end pipeline test mirroring the alchemy.com scenario
  • All 849 existing tests pass

  • Lint clean (npm run lint)

  • Verified live against https://www.alchemy.com/docs

Files

src/checks/content-discoverability/llms-txt-exists.ts        | wires up canonical pick + override
src/checks/content-discoverability/llms-txt-{size,valid,...} | use canonical via helper
src/helpers/llms-txt.ts                                      | selection + analysis helpers
src/helpers/get-page-urls.ts                                 | sampling uses canonical
src/checks/observability/cache-header-hygiene.ts             | comment-only (intentionally unchanged)
src/cli/commands/check.ts + src/types.ts                     | --llms-txt-url flag
docs/{checks/content-discoverability,reference/cli,reference/config-file}.md | user docs

A follow-up PR (#55) addresses a related bug discovered while testing this change: our aggregate .txt walker only descends one level, so multi-level nested indexes like Alchemy's chains structure are undercounted by llms-txt-freshness. That fix is intentionally scoped separately.

When a site has both an apex `llms.txt` (often a marketing index) and a
docs-section `llms.txt`, downstream checks were aggregating links from
both. Sampling, size, validity, and freshness all leaked apex pages into
docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53.

This change introduces a single canonical pick:

- `llms-txt-exists` selects one file as canonical using a most-specific-
  -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt`
  when the user passed `/docs`, and the other way around when they passed
  the origin). Ties resolve to the candidate-discovery order.
- Downstream checks (`llms-txt-size`, `llms-txt-valid`,
  `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via
  `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`)
  now operate on that single file via a new `getLlmsTxtFilesForAnalysis`
  helper. `cache-header-hygiene` keeps probing every discovered file.
- A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets
  users point afdocs at an explicit canonical, bypassing the heuristic.
  When set, only that URL is probed and the cross-host fallback is
  skipped.
- The pass message now surfaces which file was chosen, and `details`
  exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for
  scripted consumers.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Apex llms.txt drowns out {baseUrl}/llms.txt when both exist

1 participant