Conversation
Trusted Server RSL-compliant AI crawler detection and licensing enforcement, MVP-ready. Six-signal classification (UA, IP, JA4, ASN, H2, robots/license.xml correlation), permissive-by-default with strict override, public license.toml + private license.private.toml split, standards-compliant 402/403 responses, debug endpoints, structured logging. Targets publishers already running TS. Phase 2 adds OLP license server for programmatic token issuance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aram356
left a comment
There was a problem hiding this comment.
Summary
Design spec PR — single 1055-line markdown file in docs/superpowers/specs/, no code changes. The proposed RSL-compliant AI crawler licensing layer is well-scoped and the public/private config split is sound, but the spec misrepresents the current state of TS infrastructure and contains two architectural assumptions that don't hold on Fastly Compute. Requesting changes on those before merge.
Blocking
🔧 wrench
- §3.5 "Existing capability" table is materially inaccurate — JA4 signal, bot gate,
/robots.txthandling, and/_ts/debug/*auth pattern are all listed as existing but do not exist incrates/. Reframe as new infrastructure required. - §7.2 / §7.3 in-process ring buffer assumes long-lived process state — Fastly Compute WASM instances are short-lived per-request; the debug endpoints can't aggregate without KV/Config Store or external log-stream aggregation. Pick one and document the trade-off.
❓ question
- §4.1 / §3.5 — How does the WASM instance obtain JA4? Fastly Compute does not expose ClientHello bytes today. Without a concrete acquisition path the entire stealth-detection branch is unimplementable.
- §6.1 / §8.5 —
Link: rel="license"on every response, including ad/RTB/integration responses? §8.5 says integrations are unaffected, but every response gaining a header is a change. Suggest scoping to HTML responses.
Non-blocking
♻️ refactor
- §3.7 module path — should be
crates/trusted-server-core/src/integrations/rsl/, matching every other integration in the project. - §3.4 IP allowlist lookup structure unspecified — naive Vec scan over thousands of CIDRs would dominate hot-path latency; specify a radix/trie structure.
- §6.6 rendered XML drops
contact_url—license.tomldefines it but the example only renderscontactEmail. - §5.5 usage vocabulary missing
all— RSL 1.0 definesall,ai-all,ai-train,ai-input,ai-index,search.
🤔 thinking
- §3.8 / §8.1 "no Fastly-specific dependencies in core" overstates current reality —
crates/trusted-server-core/Cargo.tomlalready hasfastlyas a non-optional dep; PR #581/#609 are the in-progress abstraction work. - §4.1 ASN database not in the §3.9 binary-size budget — MaxMind GeoLite2-ASN is ~10 MB; reconcile with the <100 KB budget.
- §4.7 mentions 401 but §6.7 matrix doesn't — drop or describe when 401 fires.
- §4.3 IP-allowlist refresh cadence couples to TS release train — staleness window or KV-based refresh path worth acknowledging.
- §6.6 RSL
max-ageis in days, HTTPCache-Control: max-agein seconds — note the unit difference.
🌱 seedling
- §4.2
purposelikely belongs on bot identity, not request classification.
📌 out of scope
- §4.3 control-plane refresh job referenced but not designed — should appear in §2.2 if deferred.
📝 note
- IP-list URLs (
openai.com/{gptbot,searchbot,chatgpt-user}.json) verified live (200 OK).
⛏ nitpick
- format-docs CI failure is a trivial prettier whitespace diff (asterisk italics → underscore italics, table column padding); fix with
cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md. - "Trusted Server" / "TS" used interchangeably mid-paragraph; pick one.
CI Status
- format-docs: FAIL (one-command prettier fix)
- cargo fmt: PASS
- cargo clippy: PASS
- cargo test: PASS
- vitest: PASS
- browser/integration tests: PASS
- CodeQL: PASS
| | `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth | | ||
| | Structured logging (`log-fastly`) | Classification events emitted as structured log lines | | ||
| | Settings (`trusted-server.toml`) | RSL config block added to existing settings parser | | ||
|
|
There was a problem hiding this comment.
🔧 wrench — "Existing capability" table is materially inaccurate.
Four of six items in this column do not exist in the codebase today (verified by searching crates/):
- JA4 signal from edge TLS — no JA4 code, no
client_helloaccess, no TLS-fingerprint plumbing anywhere - Bot gate (H2 + JA4) — no bot gate exists
/robots.txthandling — no robots.txt handler incrates/trusted-server-core//_ts/debug/*auth pattern — no such route family or token-auth pattern exists
A reader walks away believing the implementation reuses four existing systems. It actually builds them all from scratch — a materially different effort estimate.
Fix: split the table into two columns:
- Existing capability —
IntegrationRegistrationbuilder, Settings (trusted-server.toml), structured logging - New infrastructure required — JA4 acquisition path, bot gate,
/robots.txthandler,/_ts/debug/*framework
| ### 7.3 `GET /_ts/debug/rsl/recent` | ||
|
|
||
| Last N classified requests, newest first. Backed by an in-process ring buffer | ||
| (no KV writes on hot path). Default 1000 entries, configurable. |
There was a problem hiding this comment.
🔧 wrench — In-process ring buffer assumes long-lived process state that Fastly Compute does not provide.
Fastly Compute WASM instances are short-lived per-request — there is no in-memory state shared across requests. As specified, /_ts/debug/rsl/recent and /_ts/debug/rsl/summary would only see the single classification of the request that hit the debug endpoint itself.
The spec promises both "no KV writes on the hot path" and "live counters / recent classifications" — these are mutually exclusive on Fastly Compute today.
Fix — pick one:
- Pipe
/summarythrough an external aggregator over the structured log stream (Fastly log shipping → S3/BigQuery/Datadog), and document that the debug endpoints are not live edge state. - Commit to KV/Config Store reads/writes on the hot path with the trade-offs §5.1 explicitly defers (availability, eventual consistency, auth, write QPS limits).
| |---|---|---|---|---| | ||
| | 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent | | ||
| | 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges | | ||
| | 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer | |
There was a problem hiding this comment.
❓ question — How does the WASM instance actually obtain JA4?
JA4 is the only signal that catches spoofed-UA stealth crawlers; §4.5 ("Stealth Classification Example") and the entire stealth branch in the §6.7 response matrix hinge on it.
Fastly Compute does not expose ClientHello bytes or a precomputed JA4 to the WASM instance today. Without a concrete acquisition path the stealth-detection branch is unimplementable.
Please specify which mechanism is assumed:
- VCL pre-stage that hashes ClientHello and forwards as
Fastly-JA4request header? - A closed-beta / experimental Fastly API?
- Upstream computation in a different layer?
- Phase-2 deferred until the edge platform exposes JA4?
Whichever it is, one paragraph describing it would unblock the design.
| ``` | ||
|
|
||
| TS adds the `Link` header on every response so honest crawlers can discover | ||
| license terms on any request, not just by fetching `robots.txt` first. |
There was a problem hiding this comment.
❓ question — Link: rel="license" on every response — including non-HTML responses?
§6.1 says "TS adds the Link header on every response." §8.5 says existing integrations "continue working unchanged."
In practice every ad-server response, RTB endpoint, integration proxy response (Permutive, Lockr, Datadome, Didomi), 204 beacon, and OPTIONS preflight will gain this header. Is that intentional?
- For top-level navigation HTML responses: yes, it's the goal.
- For JSON RTB bid responses or analytics 204s: it's noise that competes with
Linkheaders used for HTTP/2 push hints, preconnect, etc.
Suggest scoping to Content-Type: text/html (or top-level navigation responses) and explicitly stating the scope in §6.1.
| ├── enforcement.rs # verdict + terms + mode → Action | ||
| ├── endpoints.rs # /license.xml, /robots.txt augmentation, debug routes | ||
| └── logging.rs # structured log emission | ||
| ``` |
There was a problem hiding this comment.
♻️ refactor — Module path inconsistent with project layout.
Spec proposes crates/trusted-server-core/src/rsl/. Every other integration in TS lives under crates/trusted-server-core/src/integrations/{datadome,permutive,lockr,didomi,testlight,…} — verified in the filesystem and CLAUDE.md.
Fix: crates/trusted-server-core/src/integrations/rsl/{mod.rs, classifier.rs, …} for consistency with IntegrationRegistration discovery and the existing module structure.
| Ambiguous { | ||
| signals: Vec<Signal>, | ||
| }, | ||
| } |
There was a problem hiding this comment.
🌱 seedling — purpose likely belongs on bot identity, not request classification.
Classification::HonestAiCrawler { purpose: AiPurpose } — but a bot's declared purpose is a property of its UA (GPTBot = training, ChatGPT-User = in-conversation, OAI-SearchBot = search), not of the individual request. Consider a static BotId → AiPurpose mapping table; the classification then carries bot_identity: BotId and purpose is a derived lookup.
Not blocking — flagging now so the type design isn't locked in before implementation.
| (community-maintained list). | ||
| - **ASN database:** updated via Maxmind or equivalent on publisher's own | ||
| schedule. | ||
|
|
There was a problem hiding this comment.
📌 out of scope — Control-plane refresh job is referenced but not designed.
"Fetched by a control-plane job from each operator's published JSON endpoint" — this control plane is new infrastructure outside the WASM edge. Reasonable to defer, but should appear explicitly in §2.2 (Out of Scope) as a dependency for the "publishers always have fresh allowlists" promise. Otherwise the freshness story is implicitly "manual TS release cadence".
| | # | Signal | Source | Strength | Coverage | | ||
| |---|---|---|---|---| | ||
| | 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent | | ||
| | 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges | |
There was a problem hiding this comment.
📝 note — Verified live (200 OK) for the IP-list URLs cited:
https://openai.com/gptbot.jsonhttps://openai.com/searchbot.jsonhttps://openai.com/chatgpt-user.json
No action — recording the verification.
| @@ -0,0 +1,1055 @@ | |||
| # Trusted Server AI Crawler Licensing (RSL-compliant) | |||
|
|
|||
| *April 2026* | |||
There was a problem hiding this comment.
⛏ nitpick — format-docs CI failure is a trivial prettier whitespace diff.
- Italic syntax:
*April 2026*→_April 2026_(this line) - Markdown table column padding (§3.5 and §4.1)
Fix in one command:
cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md| crawlers that spoof user-agent strings. | ||
| 4. **Publisher-owned config** — single `license.toml` file, version-controlled, | ||
| no lock-in to a vendor's dashboard. | ||
| 5. **Open source** — publishers can audit the enforcement behavior. |
There was a problem hiding this comment.
⛏ nitpick — "Trusted Server" / "TS" used interchangeably mid-paragraph throughout §1. Pick one and stick with it for readability. CLAUDE.md doesn't enforce, but the project prose elsewhere is consistent.
Summary
Adds design spec for Trusted Server's RSL-compliant AI crawler detection and licensing enforcement layer.
/license.xml,robots.txtaugmentation,Linkheader)license.tomlfor RSL terms; privatelicense.private.tomlfor enforcement secrets and commercial overrides/_ts/debug/rsl/summary,/_ts/debug/rsl/recent,/_ts/debug/rsl/license) and structured loggingTest plan
🤖 Generated with Claude Code