Authority discovery Phase 4 — SSRF-safe fetch, verify+license gate, agentic locator (#1993)#2000
Authority discovery Phase 4 — SSRF-safe fetch, verify+license gate, agentic locator (#1993)#2000JSv4 wants to merge 6 commits into
Conversation
Part A: opencontractserver/constants/safe_http.py + opencontractserver/utils/safe_http.py - PUBLIC_DOMAIN_SOURCE_HOSTS frozenset (9 gov hosts), ALLOWED_SCHEMES (https only), MAX_REDIRECTS, CONNECT_TIMEOUT_SECONDS, READ_TIMEOUT_SECONDS, MAX_RESPONSE_BYTES. - SSRFValidationError(ValueError), host_on_allowlist(), _assert_public_ip() (rejects private/loopback/link-local/multicast/reserved/unspecified from ANY resolved address), validate_url(), safe_fetch_bytes() (manual redirect loop re-validating every hop, streamed size cap, timeouts), safe_fetch_text() (UTF-8 decode wrapper). - 24 tests in opencontractserver/tests/test_safe_http.py (no DB). Part B: frontier gate states + audit trail - Appended blocked_license / unlocated / pending_approval to AuthorityFrontier.DISCOVERY_STATE_CHOICES (choices-only, all fit max_length=32). - Migration 0088_authorityfrontier_gate_states (DB no-op AlterField). - AuthorityFrontierService.mark() gains candidate_record kwarg: append-only audit trail written to candidate_sources; existing callers unaffected. - 7 new tests in test_authority_frontier.py covering append-only semantics and the three new state values.
- us_code_provider: replace urllib.request.urlopen + manual chunked read in _load_title_xml with safe_fetch_bytes; remove now-redundant _HTTP_TIMEOUT and _MAX_DOWNLOAD_BYTES (safe_fetch enforces its own timeout + 500 MB cap). Import safe_fetch_bytes at module level so tests can patch it cleanly. - cfr_provider: replace requests.get + response.content in _fetch_impl with safe_fetch_bytes; remove requests import. safe_fetch_bytes handles timeout, redirect validation, and size cap centrally. - federal_register_provider: keep steps 1+2 on requests (steps 1 reads Location header without following, step 2 hits a fixed trusted API host). Replace step-3 requests.get on the externally-supplied raw_text_url with safe_fetch_text; remove manual _FR_ALLOWED_HOST urlparse check (the allowlist is now enforced centrally — SSRFValidationError propagates into the existing broad except, degrading to abstract as before). Tests: update HTTP-seam tests to patch safe_fetch_bytes / safe_fetch_text instead of requests.get / urllib.request.urlopen. All parse assertions unchanged. 110 tests pass; black/isort/flake8/mypy all clean.
…ted) Implements Phase 4 of authority discovery: a bounded tool-using LLM agent that locates official public-domain authority text when no deterministic provider can handle a canonical key. - enabled=False by default (opt-in per deployment) - priority=9999 (absolute last resort after deterministic providers) - requires_approval=True → gate always parks at pending_approval, never auto-ingests; no Document is created without human approval - Privacy: only citation + jurisdiction passed to agent (never doc text) - SSRF-safe tools: fetch tool routes through safe_fetch_text; non-allowlisted hosts return [blocked:] string rather than raising - agent.run pattern mirrors Phase 2 (llm_citation_extractor.py): resolve_model_spec → abuild_agent_model → make_pydantic_ai_agent with output_type=_LocatorOutput; result.output is the validated instance - 17 tests (unit + integration), all mock _run_agent so no LLM/network calls Closes #1993
…FR SSRF pre-check, audit completeness, agentic sanitize; tests - safe_http: wrap int(cl) in try/except to raise SSRFValidationError on malformed Content-Length (H1); add DNS-rebind TOCTOU docstring note (H2) - authority_discovery_service: add httpx.HTTPError to fetch except tuple (O1); skip disabled providers in _provider_for before can_handle check (O2); build and record candidate_record in fetch except block (O3); O4 already present - authority_gate_service: use word-boundary regex in heading fallback of _verify_key_match to prevent false positives on substring matches (G1); add clarifying comment on GATE_BLOCKED_LICENSE dual use (G2) - federal_register_provider: call validate_url before requests.get for both step-1 and step-2 HTTP fetches (F1) - agentic_web_locator_provider: sanitize citation and jurisdiction by stripping non-printable chars and collapsing whitespace before building instructions (A1) - tests: add multi-A-record, malformed content-length, heading false-positive, fetch-failure audit, disabled-agentic unsupported, ingested audit, safe_fetch usage assertions, and _run_agent construction/sanitize tests
Code Review — Phase 4: SSRF-safe fetch, verify+license gate, agentic locatorOverall this is a well-architected, high-quality PR. The SSRF protections are thorough and the design (opt-in, human-gated, privacy-preserving) is sound. A few specific issues below, ordered by severity. Bugs / Issues to Fix1.
try:
text, _ = await sync_to_async(safe_fetch_text)(url)
return text[:_MAX_FETCH_CHARS]
except SSRFValidationError as exc:
return f"[blocked: {exc}]"If except (SSRFValidationError, httpx.HTTPError, OSError) as exc:
return f"[error: {exc}]"2.
The method patches Security Notes3. Federal Register steps 1+2 use
Code Quality4.
5. Redundant clause in
return any(host.endswith("." + a) or host == a for a in allowlist)The return any(host.endswith("." + a) for a in allowlist)6.
timeout = httpx.Timeout(READ_TIMEOUT_SECONDS, connect=CONNECT_TIMEOUT_SECONDS)
timeout = httpx.Timeout(connect=CONNECT_TIMEOUT_SECONDS, read=READ_TIMEOUT_SECONDS)7.
The field has no bounds validation. An LLM could return 8. Noisy/dead code in
The outer two context managers ( Design Notes (non-blocking)9.
Using 10. Deferred
Summary
|
Closes #1993. Part of the open-vocabulary authority discovery initiative (epic #1997). Stacked on #1999 (Phase 3) — review/merge that first; this PR targets the Phase 3 branch.
Summary
Makes authority ingestion safe and accountable: a centralized SSRF-hardened fetch client, a verify + license gate with visible non-silent outcomes, and an opt-in agentic web locator for authorities no deterministic provider can reach — all approval-gated and privacy-preserving.
What it adds
opencontractserver/utils/safe_http.py+constants/safe_http.py): scheme + public-domain.govhost allowlist + reject if ANY resolved address is private/loopback/link-local + manual redirect loop re-validating every hop + streamed size cap + timeouts. All three Phase-3 providers now fetch only through it (one seam each); the Federal Register response-suppliedraw_text_urlis allowlist-validated before fetch.enrichment/services/authority_gate_service.py), wired into the orchestrator betweenfetchandbootstrap. Outcomes are visible, never silent drops — newAuthorityFrontierstatesblocked_license/unlocated/pending_approval(migration0088), with an append-onlycandidate_sourcesaudit trail recording every attempt (provider, license, source domain, verify result, outcome). Verify uses exact canonical-key match with a word-boundary heading fallback (no substring false positives).AgenticWebLocatorProvider(opt-in,enabled=False,priority=9999last-resort,requires_approval=True): a bounded tool-using LLM agent (web-search + the SSRF-safe fetch tool) that locates public-domain authority text. Given only the normalized citation + jurisdiction — never the citing document text (privacy; citation sanitized against prompt injection). Its results always park atpending_approval— nothing it finds is ingested without a human.Test plan
0088clean;manage.py checkclean (no E001); black/isort/flake8/mypy pass. SSRF tests cover scheme/allowlist/multi-A-record private-IP rejection/per-hop redirect re-validation/size-cap; gate tests cover license/domain/verify-mismatch/heading-false-positive/pending-approval; audit trail recorded for blocked/unlocated/failed/ingested.Notes / follow-ups
.govhosts an attacker can't control DNS for, plus the all-addresses-public check. Full DNS-pinning (resolve once, connect to the pinned IP with hostname-as-SNI) is documented in the helper as a follow-up.