Skip to content

RSL AI crawler licensing design spec#650

Open
jevansnyc wants to merge 1 commit intomainfrom
rsl-ai-crawler-licensing-spec
Open

RSL AI crawler licensing design spec#650
jevansnyc wants to merge 1 commit intomainfrom
rsl-ai-crawler-licensing-spec

Conversation

@jevansnyc
Copy link
Copy Markdown
Collaborator

Summary

Adds design spec for Trusted Server's RSL-compliant AI crawler detection and licensing enforcement layer.

  • Edge-deployed AI crawler classification using six signals (UA, IP allowlist, JA4, ASN, H2, robots/license fetch correlation)
  • RSL 1.0 standards-compliant license publishing (/license.xml, robots.txt augmentation, Link header)
  • Public license.toml for RSL terms; private license.private.toml for enforcement secrets and commercial overrides
  • Standards-compliant 402/403 enforcement responses with inline RSL fragments
  • Permissive-by-default with per-publisher/per-route Strict override
  • Debug endpoints (/_ts/debug/rsl/summary, /_ts/debug/rsl/recent, /_ts/debug/rsl/license) and structured logging
  • Integrates with existing TS architecture; no changes needed to Edge Cookie, auction orchestrator, consent, or other integrations
  • Phase 2 preview for Open License Protocol (OLP) token-based access

Test plan

  • Review spec for accuracy against current TS infrastructure (integration hooks, JA4 signals, bot gate)
  • Verify RSL usage/payment vocabulary matches RSL 1.0 spec (https://rslstandard.org/rsl)
  • Validate onboarding flow assumptions against an existing TS publisher deployment
  • Confirm binary size estimates (~100 KB additional for IP allowlists + JA4 DB + new code)

🤖 Generated with Claude Code

Trusted Server RSL-compliant AI crawler detection and licensing
enforcement, MVP-ready. Six-signal classification (UA, IP, JA4, ASN,
H2, robots/license.xml correlation), permissive-by-default with strict
override, public license.toml + private license.private.toml split,
standards-compliant 402/403 responses, debug endpoints, structured
logging. Targets publishers already running TS. Phase 2 adds OLP
license server for programmatic token issuance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jevansnyc jevansnyc linked an issue Apr 22, 2026 that may be closed by this pull request
@aram356 aram356 assigned aram356 and jevansnyc and unassigned aram356 Apr 22, 2026
@aram356 aram356 requested a review from prk-Jr April 27, 2026 15:36
Copy link
Copy Markdown
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Design spec PR — single 1055-line markdown file in docs/superpowers/specs/, no code changes. The proposed RSL-compliant AI crawler licensing layer is well-scoped and the public/private config split is sound, but the spec misrepresents the current state of TS infrastructure and contains two architectural assumptions that don't hold on Fastly Compute. Requesting changes on those before merge.

Blocking

🔧 wrench

  • §3.5 "Existing capability" table is materially inaccurate — JA4 signal, bot gate, /robots.txt handling, and /_ts/debug/* auth pattern are all listed as existing but do not exist in crates/. Reframe as new infrastructure required.
  • §7.2 / §7.3 in-process ring buffer assumes long-lived process state — Fastly Compute WASM instances are short-lived per-request; the debug endpoints can't aggregate without KV/Config Store or external log-stream aggregation. Pick one and document the trade-off.

❓ question

  • §4.1 / §3.5 — How does the WASM instance obtain JA4? Fastly Compute does not expose ClientHello bytes today. Without a concrete acquisition path the entire stealth-detection branch is unimplementable.
  • §6.1 / §8.5 — Link: rel="license" on every response, including ad/RTB/integration responses? §8.5 says integrations are unaffected, but every response gaining a header is a change. Suggest scoping to HTML responses.

Non-blocking

♻️ refactor

  • §3.7 module path — should be crates/trusted-server-core/src/integrations/rsl/, matching every other integration in the project.
  • §3.4 IP allowlist lookup structure unspecified — naive Vec scan over thousands of CIDRs would dominate hot-path latency; specify a radix/trie structure.
  • §6.6 rendered XML drops contact_urllicense.toml defines it but the example only renders contactEmail.
  • §5.5 usage vocabulary missing all — RSL 1.0 defines all, ai-all, ai-train, ai-input, ai-index, search.

🤔 thinking

  • §3.8 / §8.1 "no Fastly-specific dependencies in core" overstates current reality — crates/trusted-server-core/Cargo.toml already has fastly as a non-optional dep; PR #581/#609 are the in-progress abstraction work.
  • §4.1 ASN database not in the §3.9 binary-size budget — MaxMind GeoLite2-ASN is ~10 MB; reconcile with the <100 KB budget.
  • §4.7 mentions 401 but §6.7 matrix doesn't — drop or describe when 401 fires.
  • §4.3 IP-allowlist refresh cadence couples to TS release train — staleness window or KV-based refresh path worth acknowledging.
  • §6.6 RSL max-age is in days, HTTP Cache-Control: max-age in seconds — note the unit difference.

🌱 seedling

  • §4.2 purpose likely belongs on bot identity, not request classification.

📌 out of scope

  • §4.3 control-plane refresh job referenced but not designed — should appear in §2.2 if deferred.

📝 note

  • IP-list URLs (openai.com/{gptbot,searchbot,chatgpt-user}.json) verified live (200 OK).

⛏ nitpick

  • format-docs CI failure is a trivial prettier whitespace diff (asterisk italics → underscore italics, table column padding); fix with cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md.
  • "Trusted Server" / "TS" used interchangeably mid-paragraph; pick one.

CI Status

  • format-docs: FAIL (one-command prettier fix)
  • cargo fmt: PASS
  • cargo clippy: PASS
  • cargo test: PASS
  • vitest: PASS
  • browser/integration tests: PASS
  • CodeQL: PASS

| `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth |
| Structured logging (`log-fastly`) | Classification events emitted as structured log lines |
| Settings (`trusted-server.toml`) | RSL config block added to existing settings parser |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — "Existing capability" table is materially inaccurate.

Four of six items in this column do not exist in the codebase today (verified by searching crates/):

  • JA4 signal from edge TLS — no JA4 code, no client_hello access, no TLS-fingerprint plumbing anywhere
  • Bot gate (H2 + JA4) — no bot gate exists
  • /robots.txt handling — no robots.txt handler in crates/trusted-server-core/
  • /_ts/debug/* auth pattern — no such route family or token-auth pattern exists

A reader walks away believing the implementation reuses four existing systems. It actually builds them all from scratch — a materially different effort estimate.

Fix: split the table into two columns:

  • Existing capabilityIntegrationRegistration builder, Settings (trusted-server.toml), structured logging
  • New infrastructure required — JA4 acquisition path, bot gate, /robots.txt handler, /_ts/debug/* framework

### 7.3 `GET /_ts/debug/rsl/recent`

Last N classified requests, newest first. Backed by an in-process ring buffer
(no KV writes on hot path). Default 1000 entries, configurable.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — In-process ring buffer assumes long-lived process state that Fastly Compute does not provide.

Fastly Compute WASM instances are short-lived per-request — there is no in-memory state shared across requests. As specified, /_ts/debug/rsl/recent and /_ts/debug/rsl/summary would only see the single classification of the request that hit the debug endpoint itself.

The spec promises both "no KV writes on the hot path" and "live counters / recent classifications" — these are mutually exclusive on Fastly Compute today.

Fix — pick one:

  1. Pipe /summary through an external aggregator over the structured log stream (Fastly log shipping → S3/BigQuery/Datadog), and document that the debug endpoints are not live edge state.
  2. Commit to KV/Config Store reads/writes on the hot path with the trade-offs §5.1 explicitly defers (availability, eventual consistency, auth, write QPS limits).

|---|---|---|---|---|
| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |
| 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question — How does the WASM instance actually obtain JA4?

JA4 is the only signal that catches spoofed-UA stealth crawlers; §4.5 ("Stealth Classification Example") and the entire stealth branch in the §6.7 response matrix hinge on it.

Fastly Compute does not expose ClientHello bytes or a precomputed JA4 to the WASM instance today. Without a concrete acquisition path the stealth-detection branch is unimplementable.

Please specify which mechanism is assumed:

  • VCL pre-stage that hashes ClientHello and forwards as Fastly-JA4 request header?
  • A closed-beta / experimental Fastly API?
  • Upstream computation in a different layer?
  • Phase-2 deferred until the edge platform exposes JA4?

Whichever it is, one paragraph describing it would unblock the design.

```

TS adds the `Link` header on every response so honest crawlers can discover
license terms on any request, not just by fetching `robots.txt` first.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionLink: rel="license" on every response — including non-HTML responses?

§6.1 says "TS adds the Link header on every response." §8.5 says existing integrations "continue working unchanged."

In practice every ad-server response, RTB endpoint, integration proxy response (Permutive, Lockr, Datadome, Didomi), 204 beacon, and OPTIONS preflight will gain this header. Is that intentional?

  • For top-level navigation HTML responses: yes, it's the goal.
  • For JSON RTB bid responses or analytics 204s: it's noise that competes with Link headers used for HTTP/2 push hints, preconnect, etc.

Suggest scoping to Content-Type: text/html (or top-level navigation responses) and explicitly stating the scope in §6.1.

├── enforcement.rs # verdict + terms + mode → Action
├── endpoints.rs # /license.xml, /robots.txt augmentation, debug routes
└── logging.rs # structured log emission
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ refactor — Module path inconsistent with project layout.

Spec proposes crates/trusted-server-core/src/rsl/. Every other integration in TS lives under crates/trusted-server-core/src/integrations/{datadome,permutive,lockr,didomi,testlight,…} — verified in the filesystem and CLAUDE.md.

Fix: crates/trusted-server-core/src/integrations/rsl/{mod.rs, classifier.rs, …} for consistency with IntegrationRegistration discovery and the existing module structure.

Ambiguous {
signals: Vec<Signal>,
},
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌱 seedlingpurpose likely belongs on bot identity, not request classification.

Classification::HonestAiCrawler { purpose: AiPurpose } — but a bot's declared purpose is a property of its UA (GPTBot = training, ChatGPT-User = in-conversation, OAI-SearchBot = search), not of the individual request. Consider a static BotId → AiPurpose mapping table; the classification then carries bot_identity: BotId and purpose is a derived lookup.

Not blocking — flagging now so the type design isn't locked in before implementation.

(community-maintained list).
- **ASN database:** updated via Maxmind or equivalent on publisher's own
schedule.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📌 out of scope — Control-plane refresh job is referenced but not designed.

"Fetched by a control-plane job from each operator's published JSON endpoint" — this control plane is new infrastructure outside the WASM edge. Reasonable to defer, but should appear explicitly in §2.2 (Out of Scope) as a dependency for the "publishers always have fresh allowlists" promise. Otherwise the freshness story is implicitly "manual TS release cadence".

| # | Signal | Source | Strength | Coverage |
|---|---|---|---|---|
| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent |
| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 note — Verified live (200 OK) for the IP-list URLs cited:

  • https://openai.com/gptbot.json
  • https://openai.com/searchbot.json
  • https://openai.com/chatgpt-user.json

No action — recording the verification.

@@ -0,0 +1,1055 @@
# Trusted Server AI Crawler Licensing (RSL-compliant)

*April 2026*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — format-docs CI failure is a trivial prettier whitespace diff.

  • Italic syntax: *April 2026*_April 2026_ (this line)
  • Markdown table column padding (§3.5 and §4.1)

Fix in one command:

cd docs && npx prettier --write superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md

crawlers that spoof user-agent strings.
4. **Publisher-owned config** — single `license.toml` file, version-controlled,
no lock-in to a vendor's dashboard.
5. **Open source** — publishers can audit the enforcement behavior.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick — "Trusted Server" / "TS" used interchangeably mid-paragraph throughout §1. Pick one and stick with it for readability. CLAUDE.md doesn't enforce, but the project prose elsewhere is consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write Spec for RSL Support

2 participants