Walk nested aggregate llms.txt files recursively#55
Open
SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
Open
Walk nested aggregate llms.txt files recursively#55SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
Conversation
When a site has both an apex `llms.txt` (often a marketing index) and a docs-section `llms.txt`, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53. This change introduces a single canonical pick: - `llms-txt-exists` selects one file as canonical using a most-specific- -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt` when the user passed `/docs`, and the other way around when they passed the origin). Ties resolve to the candidate-discovery order. - Downstream checks (`llms-txt-size`, `llms-txt-valid`, `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`) now operate on that single file via a new `getLlmsTxtFilesForAnalysis` helper. `cache-header-hygiene` keeps probing every discovered file. - A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets users point afdocs at an explicit canonical, bypassing the heuristic. When set, only that URL is probed and the cross-host fallback is skipped. - The pass message now surfaces which file was chosen, and `details` exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for scripted consumers. Made-with: Cursor
The previous walker only descended one level into aggregate `.txt` files
referenced from llms.txt, with an explicit "skip further .txt nesting"
filter on sub-links. That works for two-level patterns (Cloudflare,
Supabase) but undercounts sites that use deeper progressive disclosure.
Concrete miss: alchemy.com/docs has a three-level structure
(`/docs/llms.txt` → `/docs/chains/llms.txt` → `/docs/chains/{chain}/llms.txt`).
The chain method pages — about 5,100 of them — never made it into the
URL pool, so `llms-txt-freshness` reported 6% coverage instead of 100%
and the freshness check failed for what was actually a well-organized
docs site.
Replace the one-level walk with a bounded BFS:
- Recurse to `MAX_AGGREGATE_DEPTH = 5` (deep enough for realistic trees,
shallow enough to terminate on accidental loops)
- Cap total fetches at `MAX_AGGREGATE_FILES = 200` so a malformed index
can't trigger unbounded HTTP traffic
- Track visited aggregates so the same `.txt` referenced from multiple
parents is fetched once and cycles terminate
On alchemy.com/docs this lifts the page-URL count from 311 → 5,436 and
the overall scorecard from 88 (B) → 92 (A) without any change to their
docs.
Made-with: Cursor
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The aggregate
.txtwalker that powersllms-txt-freshness(and any sampling that flows throughllms.txt) only descended one level into nested indexes, with an explicit “skip further.txtnesting” filter on sub-links.That works for two-level patterns (Cloudflare per-product files, Supabase aggregates) but undercounts sites that use deeper progressive disclosure.
The bug
alchemy.com/docshas a three-level structure:Our walker stopped after Level 1, treating the chain
.txtfiles at Level 2 as terminal and dropping them.So
getUrlsFromCachedLlmsTxtreturned only the 311 pages from the 5 sections that happened to have a flat layout (Get Started, Node, Data, Wallets, Rollups). All ~5,100 chain method pages were invisible to us.The visible symptom was
llms-txt-freshnessreporting 6% coverage (311 / 5,452 sitemap pages) and failing the check, even though Alchemy’sllms.txtstructure is actually exhaustive — we just weren’t walking it.The fix
Replace the one-level walk with a bounded BFS:
Recurse to
MAX_AGGREGATE_DEPTH = 5(deep enough for realistic trees, shallow enough to terminate on accidental loops)Cap total fetches at
MAX_AGGREGATE_FILES = 200to prevent unbounded HTTP trafficTrack visited aggregates so:
.txtreferenced from multiple parents is fetched onceThe classification logic (page URL vs aggregate) is factored into a small
classify()closure so the same rules apply to both seed URLs and discovered sub-links.Real-world evidence: alchemy.com/docs
Run on top of #54 (so canonical selection is correct):
llms.txtllms-txt-freshnessmarkdown-content-parityCombined with #54, the
alchemy.com/docsscorecard moves from 68 (D) → 92 (A) without any change to their docs.Test plan
3 new tests in
test/unit/helpers/get-page-urls.test.ts:All 849 existing tests pass (852 total with new ones)
Lint clean
Verified live against
https://www.alchemy.com/docsFiles