RSL AI crawler licensing design spec by jevansnyc · Pull Request #650 · IABTechLab/trusted-server

jevansnyc · 2026-04-22T16:02:44Z

Summary

Adds design spec for Trusted Server's RSL-compliant AI crawler detection and licensing enforcement layer.

Edge-deployed AI crawler classification using six signals (UA, IP allowlist, JA4, ASN, H2, robots/license fetch correlation)
RSL 1.0 standards-compliant license publishing (/license.xml, robots.txt augmentation, Link header)
Public license.toml for RSL terms; private license.private.toml for enforcement secrets and commercial overrides
Standards-compliant 402/403 enforcement responses with inline RSL fragments
Permissive-by-default with per-publisher/per-route Strict override
Debug endpoints (/_ts/debug/rsl/summary, /_ts/debug/rsl/recent, /_ts/debug/rsl/license) and structured logging
Integrates with existing TS architecture; no changes needed to Edge Cookie, auction orchestrator, consent, or other integrations
Phase 2 preview for Open License Protocol (OLP) token-based access

Test plan

Review spec for accuracy against current TS infrastructure (integration hooks, JA4 signals, bot gate)
Verify RSL usage/payment vocabulary matches RSL 1.0 spec (https://rslstandard.org/rsl)
Validate onboarding flow assumptions against an existing TS publisher deployment
Confirm binary size estimates (~100 KB additional for IP allowlists + JA4 DB + new code)

🤖 Generated with Claude Code

Trusted Server RSL-compliant AI crawler detection and licensing enforcement, MVP-ready. Six-signal classification (UA, IP, JA4, ASN, H2, robots/license.xml correlation), permissive-by-default with strict override, public license.toml + private license.private.toml split, standards-compliant 402/403 responses, debug endpoints, structured logging. Targets publishers already running TS. Phase 2 adds OLP license server for programmatic token issuance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aram356

Summary

Design spec PR — single 1055-line markdown file in docs/superpowers/specs/, no code changes. The proposed RSL-compliant AI crawler licensing layer is well-scoped and the public/private config split is sound, but the spec misrepresents the current state of TS infrastructure and contains two architectural assumptions that don't hold on Fastly Compute. Requesting changes on those before merge.

Blocking

🔧 wrench

§3.5 "Existing capability" table is materially inaccurate — JA4 signal, bot gate, /robots.txt handling, and /_ts/debug/* auth pattern are all listed as existing but do not exist in crates/. Reframe as new infrastructure required.
§7.2 / §7.3 in-process ring buffer assumes long-lived process state — Fastly Compute WASM instances are short-lived per-request; the debug endpoints can't aggregate without KV/Config Store or external log-stream aggregation. Pick one and document the trade-off.

❓ question

§4.1 / §3.5 — How does the WASM instance obtain JA4? Fastly Compute does not expose ClientHello bytes today. Without a concrete acquisition path the entire stealth-detection branch is unimplementable.
§6.1 / §8.5 — Link: rel="license" on every response, including ad/RTB/integration responses? §8.5 says integrations are unaffected, but every response gaining a header is a change. Suggest scoping to HTML responses.

Non-blocking

♻️ refactor

§3.7 module path — should be crates/trusted-server-core/src/integrations/rsl/, matching every other integration in the project.
§3.4 IP allowlist lookup structure unspecified — naive Vec scan over thousands of CIDRs would dominate hot-path latency; specify a radix/trie structure.
§6.6 rendered XML drops contact_url — license.toml defines it but the example only renders contactEmail.
§5.5 usage vocabulary missing all — RSL 1.0 defines all, ai-all, ai-train, ai-input, ai-index, search.

🤔 thinking

§3.8 / §8.1 "no Fastly-specific dependencies in core" overstates current reality — crates/trusted-server-core/Cargo.toml already has fastly as a non-optional dep; PR #581/#609 are the in-progress abstraction work.
§4.1 ASN database not in the §3.9 binary-size budget — MaxMind GeoLite2-ASN is ~10 MB; reconcile with the <100 KB budget.
§4.7 mentions 401 but §6.7 matrix doesn't — drop or describe when 401 fires.
§4.3 IP-allowlist refresh cadence couples to TS release train — staleness window or KV-based refresh path worth acknowledging.
§6.6 RSL max-age is in days, HTTP Cache-Control: max-age in seconds — note the unit difference.

🌱 seedling

§4.2 purpose likely belongs on bot identity, not request classification.

📌 out of scope

§4.3 control-plane refresh job referenced but not designed — should appear in §2.2 if deferred.

📝 note

IP-list URLs (openai.com/{gptbot,searchbot,chatgpt-user}.json) verified live (200 OK).

⛏ nitpick

format-docs CI failure is a trivial prettier whitespace diff (asterisk italics → underscore italics, table column padding); fix with cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md.
"Trusted Server" / "TS" used interchangeably mid-paragraph; pick one.

CI Status

format-docs: FAIL (one-command prettier fix)
cargo fmt: PASS
cargo clippy: PASS
cargo test: PASS
vitest: PASS
browser/integration tests: PASS
CodeQL: PASS

aram356 · 2026-04-30T08:03:41Z

+| `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth |
+| Structured logging (`log-fastly`) | Classification events emitted as structured log lines |
+| Settings (`trusted-server.toml`) | RSL config block added to existing settings parser |
+


🔧 wrench — "Existing capability" table is materially inaccurate.

Four of six items in this column do not exist in the codebase today (verified by searching crates/):

JA4 signal from edge TLS — no JA4 code, no client_hello access, no TLS-fingerprint plumbing anywhere

Bot gate (H2 + JA4) — no bot gate exists

/robots.txt handling — no robots.txt handler in crates/trusted-server-core/

/_ts/debug/* auth pattern — no such route family or token-auth pattern exists

A reader walks away believing the implementation reuses four existing systems. It actually builds them all from scratch — a materially different effort estimate.

Fix: split the table into two columns:

Existing capability — IntegrationRegistration builder, Settings (trusted-server.toml), structured logging

New infrastructure required — JA4 acquisition path, bot gate, /robots.txt handler, /_ts/debug/* framework

aram356 · 2026-04-30T08:03:42Z

+### 7.3 `GET /_ts/debug/rsl/recent`
+
+Last N classified requests, newest first. Backed by an in-process ring buffer
+(no KV writes on hot path). Default 1000 entries, configurable.


🔧 wrench — In-process ring buffer assumes long-lived process state that Fastly Compute does not provide.

Fastly Compute WASM instances are short-lived per-request — there is no in-memory state shared across requests. As specified, /_ts/debug/rsl/recent and /_ts/debug/rsl/summary would only see the single classification of the request that hit the debug endpoint itself.

The spec promises both "no KV writes on the hot path" and "live counters / recent classifications" — these are mutually exclusive on Fastly Compute today.

Fix — pick one:

Pipe /summary through an external aggregator over the structured log stream (Fastly log shipping → S3/BigQuery/Datadog), and document that the debug endpoints are not live edge state.

Commit to KV/Config Store reads/writes on the hot path with the trade-offs §5.1 explicitly defers (availability, eventual consistency, auth, write QPS limits).

aram356 · 2026-04-30T08:03:42Z

+|---|---|---|---|---|
+| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
+| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |
+| 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer |


❓ question — How does the WASM instance actually obtain JA4?

JA4 is the only signal that catches spoofed-UA stealth crawlers; §4.5 ("Stealth Classification Example") and the entire stealth branch in the §6.7 response matrix hinge on it.

Fastly Compute does not expose ClientHello bytes or a precomputed JA4 to the WASM instance today. Without a concrete acquisition path the stealth-detection branch is unimplementable.

Please specify which mechanism is assumed:

VCL pre-stage that hashes ClientHello and forwards as Fastly-JA4 request header?

A closed-beta / experimental Fastly API?

Upstream computation in a different layer?

Phase-2 deferred until the edge platform exposes JA4?

Whichever it is, one paragraph describing it would unblock the design.

aram356 · 2026-04-30T08:03:42Z

+```
+
+TS adds the `Link` header on every response so honest crawlers can discover
+license terms on any request, not just by fetching `robots.txt` first.


❓ question — Link: rel="license" on every response — including non-HTML responses?

§6.1 says "TS adds the Link header on every response." §8.5 says existing integrations "continue working unchanged."

In practice every ad-server response, RTB endpoint, integration proxy response (Permutive, Lockr, Datadome, Didomi), 204 beacon, and OPTIONS preflight will gain this header. Is that intentional?

For top-level navigation HTML responses: yes, it's the goal.

For JSON RTB bid responses or analytics 204s: it's noise that competes with Link headers used for HTTP/2 push hints, preconnect, etc.

Suggest scoping to Content-Type: text/html (or top-level navigation responses) and explicitly stating the scope in §6.1.

aram356 · 2026-04-30T08:03:42Z

+├── enforcement.rs       # verdict + terms + mode → Action
+├── endpoints.rs         # /license.xml, /robots.txt augmentation, debug routes
+└── logging.rs           # structured log emission
+```


♻️ refactor — Module path inconsistent with project layout.

Spec proposes crates/trusted-server-core/src/rsl/. Every other integration in TS lives under crates/trusted-server-core/src/integrations/{datadome,permutive,lockr,didomi,testlight,…} — verified in the filesystem and CLAUDE.md.

Fix: crates/trusted-server-core/src/integrations/rsl/{mod.rs, classifier.rs, …} for consistency with IntegrationRegistration discovery and the existing module structure.

aram356 · 2026-04-30T08:03:42Z

+    Ambiguous {
+        signals: Vec<Signal>,
+    },
+}


🌱 seedling — purpose likely belongs on bot identity, not request classification.

Classification::HonestAiCrawler { purpose: AiPurpose } — but a bot's declared purpose is a property of its UA (GPTBot = training, ChatGPT-User = in-conversation, OAI-SearchBot = search), not of the individual request. Consider a static BotId → AiPurpose mapping table; the classification then carries bot_identity: BotId and purpose is a derived lookup.

Not blocking — flagging now so the type design isn't locked in before implementation.

aram356 · 2026-04-30T08:03:42Z

+  (community-maintained list).
+- **ASN database:** updated via Maxmind or equivalent on publisher's own
+  schedule.
+


📌 out of scope — Control-plane refresh job is referenced but not designed.

"Fetched by a control-plane job from each operator's published JSON endpoint" — this control plane is new infrastructure outside the WASM edge. Reasonable to defer, but should appear explicitly in §2.2 (Out of Scope) as a dependency for the "publishers always have fresh allowlists" promise. Otherwise the freshness story is implicitly "manual TS release cadence".

aram356 · 2026-04-30T08:03:42Z

+| # | Signal | Source | Strength | Coverage |
+|---|---|---|---|---|
+| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
+| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |


📝 note — Verified live (200 OK) for the IP-list URLs cited:

https://openai.com/gptbot.json

https://openai.com/searchbot.json

https://openai.com/chatgpt-user.json

No action — recording the verification.

aram356 · 2026-04-30T08:03:42Z

@@ -0,0 +1,1055 @@
+# Trusted Server AI Crawler Licensing (RSL-compliant)
+
+*April 2026*


⛏ nitpick — format-docs CI failure is a trivial prettier whitespace diff.

Italic syntax: *April 2026* → _April 2026_ (this line)

Markdown table column padding (§3.5 and §4.1)

Fix in one command:

cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md

aram356 · 2026-04-30T08:03:42Z

+   crawlers that spoof user-agent strings.
+4. **Publisher-owned config** — single `license.toml` file, version-controlled,
+   no lock-in to a vendor's dashboard.
+5. **Open source** — publishers can audit the enforcement behavior.


⛏ nitpick — "Trusted Server" / "TS" used interchangeably mid-paragraph throughout §1. Pick one and stick with it for readability. CLAUDE.md doesn't enforce, but the project prose elsewhere is consistent.

jevansnyc linked an issue Apr 22, 2026 that may be closed by this pull request

Write Spec for RSL Support #649

Open

jevansnyc mentioned this pull request Apr 22, 2026

Write Spec for RSL Support #649

Open

jevansnyc requested review from ChristianPavilonis and aram356 April 22, 2026 16:05

aram356 assigned aram356 and jevansnyc and unassigned aram356 Apr 22, 2026

aram356 requested a review from prk-Jr April 27, 2026 15:36

aram356 requested changes Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSL AI crawler licensing design spec#650

RSL AI crawler licensing design spec#650
jevansnyc wants to merge 1 commit intomainfrom
rsl-ai-crawler-licensing-spec

jevansnyc commented Apr 22, 2026

Uh oh!

aram356 left a comment

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

aram356 Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,1055 @@
		# Trusted Server AI Crawler Licensing (RSL-compliant)

		April 2026

Conversation

jevansnyc commented Apr 22, 2026

Summary

Test plan

Uh oh!

aram356 left a comment

Choose a reason for hiding this comment

Summary

Blocking

🔧 wrench

❓ question

Non-blocking

♻️ refactor

🤔 thinking

🌱 seedling

📌 out of scope

📝 note

⛏ nitpick

CI Status

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants