Pick a canonical llms.txt when multiple files are discovered by SahilAujla · Pull Request #54 · agent-ecosystem/afdocs

SahilAujla · 2026-04-23T20:16:41Z

Closes #53.

Summary

When a site has both an apex llms.txt (e.g., for marketing) and a docs-section llms.txt, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal.

This PR makes llms-txt-exists pick a single canonical file, and downstream checks operate on that one file.

Selection rule

Most-specific prefix of the baseUrl wins. Among the discovered candidates:

Base URL passed	Files found	Canonical
`example.com/docs`	`/llms.txt`, `/docs/llms.txt`	`/docs/llms.txt`
`example.com`	`/llms.txt`, `/docs/llms.txt`	`/llms.txt`
`example.com/docs/v1`	`/llms.txt`, `/docs/llms.txt`, `/docs/v1/llms.txt`	`/docs/v1/llms.txt`
`example.com/docs/v1`	`/llms.txt`, `/docs/llms.txt`	`/docs/llms.txt`

Ties resolve to candidate discovery order (already baseUrl → origin → docs). Files on a different origin (e.g., discovered via cross-host redirect fallback) score below any same-origin prefix match.

What downstream checks do now

Check	Behaviour
`llms-txt-size`, `llms-txt-valid`, `llms-txt-links-resolve`, `llms-txt-links-markdown`	Operate on the canonical file only
Sampling (`getUrlsFromCachedLlmsTxt`, `fetchLlmsTxtUrls`)	Pulls links from the canonical file only
`llms-txt-freshness`	Compares the canonical file against sitemap
`cache-header-hygiene`	Unchanged — probes all discovered files

The full list of discovered files is still available in details.discoveredFiles for visibility.

New fields:

details.canonicalLlmsTxt: the chosen file
details.canonicalUrl: convenience string
details.canonicalSource: 'heuristic' or 'explicit' (when multiple files exist or --llms-txt-url is used)

The llms-txt-exists pass message also names the canonical file inline:

PASS  llms-txt-exists  llms.txt found at 2 locations; using https://www.alchemy.com/docs/llms.txt as canonical

New `--llms-txt-url` flag

For cases where the heuristic isn’t correct (non-standard path, monorepo, pre-publish verification), users can specify the canonical explicitly:

afdocs check https://example.com/docs --llms-txt-url https://example.com/internal/llms.txt

Also accepted as options.llmsTxtUrl in agent-docs.config.yml.

When set:

Only that URL is probed
Cross-host fallback is skipped
If the URL doesn’t resolve, the check fails with an explicit message (no fallback)

Real-world evidence: alchemy.com/docs

The issue reporter’s example.

Before this PR, scoring https://www.alchemy.com/docs picked the 159K-character apex marketing llms.txt as canonical (because it was larger and discovered alongside the docs file).

Metric	Before	After
Canonical	`alchemy.com/llms.txt` (159K, ~683 marketing links)	`alchemy.com/docs/llms.txt` (495 chars, 6 docs sections)
`llms-txt-size`	FAIL (158,998 chars)	PASS (495 chars)
`llms-txt-valid`	WARN (apex missing blockquote)	PASS
`llms-txt-links-resolve`	19/50 (sampled marketing links)	PASS — 6/6 same-origin
`llms-txt-links-markdown`	19/50	PASS — 6/6 (100%) markdown
Sampled URLs	`/blog/`, `/case-studies/`, `/overviews/`	All under `/docs/`
Content Discoverability	capped at D (`llms-txt-size` failure)	100/100 (A+)
Overall score	68 (D)	88 (B)

The remaining failures after this PR (e.g., content-negotiation, page-size-html) are real signal about the docs section that the apex was previously masking.

Test plan

24 new tests across:
- test/unit/helpers/llms-txt.test.ts
- test/unit/checks/llms-txt-exists.test.ts
- test/unit/cli/check-command.test.ts
- test/integration/check-pipeline.test.ts
Includes:
- canonical selection across all path-prefix scenarios
- --llms-txt-url happy path, missing target, and cross-origin warning
- end-to-end pipeline test mirroring the alchemy.com scenario
All 849 existing tests pass
Lint clean (npm run lint)
Verified live against https://www.alchemy.com/docs

Files

src/checks/content-discoverability/llms-txt-exists.ts        | wires up canonical pick + override
src/checks/content-discoverability/llms-txt-{size,valid,...} | use canonical via helper
src/helpers/llms-txt.ts                                      | selection + analysis helpers
src/helpers/get-page-urls.ts                                 | sampling uses canonical
src/checks/observability/cache-header-hygiene.ts             | comment-only (intentionally unchanged)
src/cli/commands/check.ts + src/types.ts                     | --llms-txt-url flag
docs/{checks/content-discoverability,reference/cli,reference/config-file}.md | user docs

A follow-up PR (#55) addresses a related bug discovered while testing this change: our aggregate .txt walker only descends one level, so multi-level nested indexes like Alchemy's chains structure are undercounted by llms-txt-freshness. That fix is intentionally scoped separately.

When a site has both an apex `llms.txt` (often a marketing index) and a docs-section `llms.txt`, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53. This change introduces a single canonical pick: - `llms-txt-exists` selects one file as canonical using a most-specific- -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt` when the user passed `/docs`, and the other way around when they passed the origin). Ties resolve to the candidate-discovery order. - Downstream checks (`llms-txt-size`, `llms-txt-valid`, `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`) now operate on that single file via a new `getLlmsTxtFilesForAnalysis` helper. `cache-header-hygiene` keeps probing every discovered file. - A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets users point afdocs at an explicit canonical, bypassing the heuristic. When set, only that URL is probed and the cross-host fallback is skipped. - The pass message now surfaces which file was chosen, and `details` exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for scripted consumers. Made-with: Cursor

SahilAujla mentioned this pull request Apr 23, 2026

Walk nested aggregate llms.txt files recursively #55

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pick a canonical llms.txt when multiple files are discovered#54

Pick a canonical llms.txt when multiple files are discovered#54
SahilAujla wants to merge 1 commit intoagent-ecosystem:mainfrom
SahilAujla:fix/canonical-llms-txt-selection

SahilAujla commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SahilAujla commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Selection rule

What downstream checks do now

New --llms-txt-url flag

Real-world evidence: alchemy.com/docs

Test plan

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SahilAujla commented Apr 23, 2026 •

edited

Loading

New `--llms-txt-url` flag