Skip to content

Authority discovery Phase 4 — SSRF-safe fetch, verify+license gate, agentic locator (#1993)#2000

Open
JSv4 wants to merge 6 commits into
feat/authority-phase3-providersfrom
feat/authority-phase4-agentic-gate
Open

Authority discovery Phase 4 — SSRF-safe fetch, verify+license gate, agentic locator (#1993)#2000
JSv4 wants to merge 6 commits into
feat/authority-phase3-providersfrom
feat/authority-phase4-agentic-gate

Conversation

@JSv4

@JSv4 JSv4 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Closes #1993. Part of the open-vocabulary authority discovery initiative (epic #1997). Stacked on #1999 (Phase 3) — review/merge that first; this PR targets the Phase 3 branch.

Summary

Makes authority ingestion safe and accountable: a centralized SSRF-hardened fetch client, a verify + license gate with visible non-silent outcomes, and an opt-in agentic web locator for authorities no deterministic provider can reach — all approval-gated and privacy-preserving.

What it adds

  • SSRF-safe HTTP (opencontractserver/utils/safe_http.py + constants/safe_http.py): scheme + public-domain .gov host allowlist + reject if ANY resolved address is private/loopback/link-local + manual redirect loop re-validating every hop + streamed size cap + timeouts. All three Phase-3 providers now fetch only through it (one seam each); the Federal Register response-supplied raw_text_url is allowlist-validated before fetch.
  • Verify + license gate (enrichment/services/authority_gate_service.py), wired into the orchestrator between fetch and bootstrap. Outcomes are visible, never silent drops — new AuthorityFrontier states blocked_license / unlocated / pending_approval (migration 0088), with an append-only candidate_sources audit trail recording every attempt (provider, license, source domain, verify result, outcome). Verify uses exact canonical-key match with a word-boundary heading fallback (no substring false positives).
  • AgenticWebLocatorProvider (opt-in, enabled=False, priority=9999 last-resort, requires_approval=True): a bounded tool-using LLM agent (web-search + the SSRF-safe fetch tool) that locates public-domain authority text. Given only the normalized citation + jurisdiction — never the citing document text (privacy; citation sanitized against prompt injection). Its results always park at pending_approval — nothing it finds is ingested without a human.

Test plan

  • CI green (full backend suite)
  • Local: 217 tests pass (safe_http, gate, agentic, providers, frontier, discovery) + regression; migration 0088 clean; manage.py check clean (no E001); black/isort/flake8/mypy pass. SSRF tests cover scheme/allowlist/multi-A-record private-IP rejection/per-hop redirect re-validation/size-cap; gate tests cover license/domain/verify-mismatch/heading-false-positive/pending-approval; audit trail recorded for blocked/unlocated/failed/ingested.

Notes / follow-ups

  • DNS-rebind TOCTOU residual: the helper validates DNS at check time while httpx re-resolves at connect time. Mitigated here because the allowlist is a fixed set of .gov hosts an attacker can't control DNS for, plus the all-addresses-public check. Full DNS-pinning (resolve once, connect to the pinned IP with hostname-as-SNI) is documented in the helper as a follow-up.
  • Fourth of four stacked PRs; Phase 5 (Authority discovery Phase 5 — bounded recursive crawl #1994) builds on this branch (the bounded recursive crawl).

JSv4 added 6 commits June 15, 2026 04:35
Part A: opencontractserver/constants/safe_http.py + opencontractserver/utils/safe_http.py
- PUBLIC_DOMAIN_SOURCE_HOSTS frozenset (9 gov hosts), ALLOWED_SCHEMES (https only),
  MAX_REDIRECTS, CONNECT_TIMEOUT_SECONDS, READ_TIMEOUT_SECONDS, MAX_RESPONSE_BYTES.
- SSRFValidationError(ValueError), host_on_allowlist(), _assert_public_ip() (rejects
  private/loopback/link-local/multicast/reserved/unspecified from ANY resolved address),
  validate_url(), safe_fetch_bytes() (manual redirect loop re-validating every hop,
  streamed size cap, timeouts), safe_fetch_text() (UTF-8 decode wrapper).
- 24 tests in opencontractserver/tests/test_safe_http.py (no DB).

Part B: frontier gate states + audit trail
- Appended blocked_license / unlocated / pending_approval to
  AuthorityFrontier.DISCOVERY_STATE_CHOICES (choices-only, all fit max_length=32).
- Migration 0088_authorityfrontier_gate_states (DB no-op AlterField).
- AuthorityFrontierService.mark() gains candidate_record kwarg: append-only audit
  trail written to candidate_sources; existing callers unaffected.
- 7 new tests in test_authority_frontier.py covering append-only semantics and the
  three new state values.
- us_code_provider: replace urllib.request.urlopen + manual chunked read in
  _load_title_xml with safe_fetch_bytes; remove now-redundant _HTTP_TIMEOUT
  and _MAX_DOWNLOAD_BYTES (safe_fetch enforces its own timeout + 500 MB cap).
  Import safe_fetch_bytes at module level so tests can patch it cleanly.

- cfr_provider: replace requests.get + response.content in _fetch_impl with
  safe_fetch_bytes; remove requests import. safe_fetch_bytes handles timeout,
  redirect validation, and size cap centrally.

- federal_register_provider: keep steps 1+2 on requests (steps 1 reads
  Location header without following, step 2 hits a fixed trusted API host).
  Replace step-3 requests.get on the externally-supplied raw_text_url with
  safe_fetch_text; remove manual _FR_ALLOWED_HOST urlparse check (the
  allowlist is now enforced centrally — SSRFValidationError propagates into
  the existing broad except, degrading to abstract as before).

Tests: update HTTP-seam tests to patch safe_fetch_bytes / safe_fetch_text
instead of requests.get / urllib.request.urlopen. All parse assertions
unchanged. 110 tests pass; black/isort/flake8/mypy all clean.
…ted)

Implements Phase 4 of authority discovery: a bounded tool-using LLM agent
that locates official public-domain authority text when no deterministic
provider can handle a canonical key.

- enabled=False by default (opt-in per deployment)
- priority=9999 (absolute last resort after deterministic providers)
- requires_approval=True → gate always parks at pending_approval, never
  auto-ingests; no Document is created without human approval
- Privacy: only citation + jurisdiction passed to agent (never doc text)
- SSRF-safe tools: fetch tool routes through safe_fetch_text; non-allowlisted
  hosts return [blocked:] string rather than raising
- agent.run pattern mirrors Phase 2 (llm_citation_extractor.py):
  resolve_model_spec → abuild_agent_model → make_pydantic_ai_agent with
  output_type=_LocatorOutput; result.output is the validated instance
- 17 tests (unit + integration), all mock _run_agent so no LLM/network calls

Closes #1993
…FR SSRF pre-check, audit completeness, agentic sanitize; tests

- safe_http: wrap int(cl) in try/except to raise SSRFValidationError on malformed
  Content-Length (H1); add DNS-rebind TOCTOU docstring note (H2)
- authority_discovery_service: add httpx.HTTPError to fetch except tuple (O1);
  skip disabled providers in _provider_for before can_handle check (O2);
  build and record candidate_record in fetch except block (O3); O4 already present
- authority_gate_service: use word-boundary regex in heading fallback of
  _verify_key_match to prevent false positives on substring matches (G1);
  add clarifying comment on GATE_BLOCKED_LICENSE dual use (G2)
- federal_register_provider: call validate_url before requests.get for both
  step-1 and step-2 HTTP fetches (F1)
- agentic_web_locator_provider: sanitize citation and jurisdiction by stripping
  non-printable chars and collapsing whitespace before building instructions (A1)
- tests: add multi-A-record, malformed content-length, heading false-positive,
  fetch-failure audit, disabled-agentic unsupported, ingested audit,
  safe_fetch usage assertions, and _run_agent construction/sanitize tests
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — Phase 4: SSRF-safe fetch, verify+license gate, agentic locator

Overall this is a well-architected, high-quality PR. The SSRF protections are thorough and the design (opt-in, human-gated, privacy-preserving) is sound. A few specific issues below, ordered by severity.


Bugs / Issues to Fix

1. _tool_fetch_allowlisted swallows only SSRFValidationError — network errors crash the agent loop

opencontractserver/pipeline/authority_source_providers/agentic_web_locator_provider.py:708-724

try:
    text, _ = await sync_to_async(safe_fetch_text)(url)
    return text[:_MAX_FETCH_CHARS]
except SSRFValidationError as exc:
    return f"[blocked: {exc}]"

If safe_fetch_text raises httpx.HTTPError, httpx.TimeoutException, or OSError (all plausible for .gov hosts), it propagates out of the tool and the pydantic-ai agent loop sees an unhandled exception. The design intent stated in the docstring ("so the agent loop is never interrupted") requires also catching these:

except (SSRFValidationError, httpx.HTTPError, OSError) as exc:
    return f"[error: {exc}]"

2. test_nonprintable_chars_stripped_from_citation has zero assertions

opencontractserver/tests/test_agentic_web_locator.py:1378-1407

The method patches _run_agent, calls _fetch_impl, and then has a comment that says "we need to call _run_agent directly" — but it never does. The test body ends with provider._fetch_impl(req) and no assertions. It passes vacuously. Either add the assertion or delete the test (the intent is covered by test_run_agent_sanitizes_citation_in_place).


Security Notes

3. Federal Register steps 1+2 use requests without a size cap

opencontractserver/pipeline/authority_source_providers/federal_register_provider.py:155-198

validate_url() pre-checks scheme/allowlist/DNS for these requests — good. But requests.get(doc_json_url, ...) has no body size cap. For the metadata JSON this is practically bounded, but it is an inconsistency with the stated "all outbound HTTP through safe_fetch_bytes" design. Worth a follow-up issue or a comment in the code explaining why the JSON responses are exempt.


Code Quality

4. asyncio.get_event_loop() is deprecated in Python 3.10+

opencontractserver/tests/test_agentic_web_locator.py:1128, 1204, 1375, 1455

asyncio.get_event_loop().run_until_complete(coro) emits a deprecation warning in Python 3.10 and will be removed in a future version. Replace with asyncio.run(coro).


5. Redundant clause in host_on_allowlist

opencontractserver/utils/safe_http.py:231

return any(host.endswith("." + a) or host == a for a in allowlist)

The host == a branch is unreachable here because if host in allowlist: return True was already checked above. The suffix-only check is sufficient:

return any(host.endswith("." + a) for a in allowlist)

6. httpx.Timeout positional arg is "total/default", not "read"

opencontractserver/utils/safe_http.py:292

timeout = httpx.Timeout(READ_TIMEOUT_SECONDS, connect=CONNECT_TIMEOUT_SECONDS)

httpx.Timeout(t) sets t as the default for all timeouts (read, write, pool) before the named overrides are applied. Functionally correct since connect=CONNECT_TIMEOUT_SECONDS overrides the connect timeout and the positional value becomes the read/pool default. But the variable name READ_TIMEOUT_SECONDS implies it's read-only, which is misleading. Consider:

timeout = httpx.Timeout(connect=CONNECT_TIMEOUT_SECONDS, read=READ_TIMEOUT_SECONDS)

7. _LocatorOutput.confidence accepts any float

opencontractserver/pipeline/authority_source_providers/agentic_web_locator_provider.py:553

The field has no bounds validation. An LLM could return -0.5 or 999.0. Consider confidence: float = Field(ge=0.0, le=1.0) or just drop confidence — it's unused downstream (the gate doesn't read it).


8. Noisy/dead code in test_ssrf_error_returns_blocked_string

opencontractserver/tests/test_agentic_web_locator.py:1131-1186

The outer two context managers (with patch("...sync_to_async", ...) and the generator-throw trick) don't do anything — the real work is _direct() which patches safe_fetch_text directly. The outer code is dead but visually suggests multiple layered protections. The test would be cleaner as just the inner _direct() path.


Design Notes (non-blocking)

9. GATE_BLOCKED_LICENSE covers two distinct failure modes

opencontractserver/enrichment/services/authority_gate_service.py:439-451

Using blocked_license for both "provider declared non-public-domain license" (check 1) and "source URL host not on allowlist" (check 3) conflates two conceptually different failures. The reason field in candidate_sources preserves the distinction, but the discovery_state alone can't distinguish them. The inline comment acknowledges this is intentional — but it's worth a follow-up issue if the admin UI ever needs to filter by failure reason.

10. Deferred from django.utils import timezone imports

opencontractserver/enrichment/services/authority_discovery_service.py:127, 145

timezone is imported twice inside the method body (once in the except block, once in the gate section). Moving from django.utils import timezone to the module top would be cleaner and consistent with the rest of the file. (The deferred pattern is sometimes used to avoid circular imports — if that's the reason, a comment would help.)


Summary

  • ✅ SSRF protections are thorough and correctly implemented (multi-A-record check, per-hop redirect re-validation, size cap on both headers and bytes)
  • ✅ Gate design (ordered checks, frozen dataclass, non-silent verdicts, append-only audit trail) is clean and well-tested
  • ✅ Agentic provider opt-in/approval design is correct; privacy isolation looks solid
  • ✅ Test coverage is comprehensive (24 SSRF tests + gate unit tests + integration tests)
  • 🔴 Fix _tool_fetch_allowlisted to also catch httpx.HTTPError/OSError (agent loop robustness)
  • 🔴 Remove or fix the zero-assertion test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant