Skip to content

Authority discovery Phase 5 — bounded recursive crawl + governance-graph regimes (#1994)#2001

Open
JSv4 wants to merge 9 commits into
feat/authority-phase4-agentic-gatefrom
feat/authority-phase5-recursive-crawl
Open

Authority discovery Phase 5 — bounded recursive crawl + governance-graph regimes (#1994)#2001
JSv4 wants to merge 9 commits into
feat/authority-phase4-agentic-gatefrom
feat/authority-phase5-recursive-crawl

Conversation

@JSv4

@JSv4 JSv4 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Closes #1994. Final phase of the open-vocabulary authority discovery initiative (epic #1997). Stacked on #2000 (Phase 4) — review/merge that first; this PR targets the Phase 4 branch.

Summary

Makes the reference web self-expanding: a bounded breadth-first crawl that discovers the authorities a corpus cites, then the authorities those cite, reusing the Phase 3/4 provider→gate→bootstrap→relink pipeline — under hard, surfaced bounds that guarantee termination.

What it adds

  • CrawlAuthoritiesService (enrichment/services/crawl_authorities_service.py): seeds the AuthorityFrontier from wanted_authorities (depth 0), then loops — dequeue highest-demand queued row → discover_and_bootstrap → on ingested, re-extract the new authority document's OWN outbound citations and seed them at depth+1. Provably terminating: every iteration marks a row terminal, parks a per-jurisdiction-capped row at deferred_cap (so dequeue_queued never re-returns it), or hits a hard cap. Bounds (all surfaced in the summary, never silent): max_depth, max_authorities, token_budget, min_demand floor, per_jurisdiction_cap. Idempotent: re-crawl creates zero duplicate authorities (get_or_create on the unique canonical_key).
  • crawl_authorities Celery task (@corpus_analyzer_task, registered) + crawl_authorities agent tool (approval-gated, write-permission, exposes all five bounds).
  • Governance-graph regime rendering: doc + ghost nodes now carry jurisdiction / authority_type, and ghost nodes carry the live discovery_state (N+1-free batch join to the frontier) — the "weird ones" (queued / blocked / unlocated) are visually distinct. New nullable GraphQL fields, non-breaking.
  • Bug fix: candidate_keys was truncating dotted/hyphenated whole sections (cfr-40:261.4cfr-40:261, usc-15:80a-1usc-15:80a, sec-rule:10b-5sec-rule:10b), breaking CFR/USC resolution. Now it strips only trailing parenthetical subsection groups. Validated across all callers (find_authority_target, wanted_authorities, relink, graph rollup).

Test plan

  • CI green (full backend suite)
  • Local: 125 tests pass (frontier, crawl engine, tool, governance graph, candidate_keys) + regression; migration 0089 clean; black/isort/flake8/mypy pass. Bounds-termination tests prove each cap sets the right stop_reason and the loop ends; idempotency + dedup + dotted-key tests included.

Notes

JSv4 added 9 commits June 15, 2026 05:45
… child seeding

- constants.py: six Phase-5 crawl bounds (CRAWL_DEFAULT_*) near WANTED_AUTHORITIES_TOP_KEYS
- annotations/models.py: append deferred_cap to DISCOVERY_STATE_CHOICES (parks cap-blocked
  rows so dequeue_queued cannot loop on them)
- migration 0089: AlterField records the new choice value
- authority_frontier_service.py: dequeue_queued (provider-agnostic, demand-ranked,
  depth/demand-bounded) + seed_child_keys (rolls subsection keys to section roots via
  candidate_keys, idempotent get_or_create on unique canonical_key)
- test_authority_frontier.py: DequeueQueuedTests (7 cases) + SeedChildKeysTests (10 cases)
  cover ordering, filtering, depth rollup, state preservation, and no-duplicate invariant;
  AuthorityFrontierGateStateTests extended with deferred_cap round-trip

46 tests, 0 failures
- Add crawl_authorities / acrawl_authorities tool functions (core_tools/corpus_references.py)
  that delegate to CrawlAuthoritiesService.crawl; export both from core_tools/__init__.py.
- Register crawl_authorities in tool_registry.py at all four sites: AVAILABLE_TOOLS
  ToolDefinition (requires_corpus, requires_approval, requires_write_permission),
  lazy import in _populate(), and FUNCTION_MAP entry.
- Extend GovernanceGraphService.build to derive jurisdiction/authority_type for
  doc_nodes (via classify_prefix) and all three fields for ghost_nodes via a single
  batch AuthorityFrontier queryset (eliminates N+1).
- Add jurisdiction, authority_type, and discovery_state nullable string fields to
  GovernanceGraphNodeType; map them in both node branches of resolve_governance_graph.
- Tests: 5 new CrawlAuthoritiesToolRegistryTests (all four registration sites) and
  3 new GovernanceGraphRegimeFieldTests (statute doc_node fields, ghost without
  frontier row, ghost with frontier row carrying discovery_state).

Closes #1994
…up/cap/dotted-key tests

- F1: Import and export crawl_authorities from opencontractserver/tasks/__init__.py
  so Celery auto-discovery and the analyzer auto-sync can find the task.
- F2: Add per_jurisdiction_cap and token_budget params to crawl_authorities /
  acrawl_authorities agent tools; forward to CrawlAuthoritiesService.crawl().
  Defaults fall back to C.CRAWL_DEFAULT_* constants (no magic numbers). Add
  both params to the crawl_authorities ToolDefinition in tool_registry.py.
- F3: Update GovernanceGraphNodeType.discovery_state description to enumerate
  the full current state set including deferred_cap, discovered, resolved,
  unsupported (matches migration choices).
- F4: Add one-line clarifying comments in governance_graph_service.py (doc
  nodes have no discovery_state) and crawl_authorities_service.py (apply
  scan is bounded because authority corpora hold one small doc per section).
- F5: Add three new tests to test_crawl_authorities.py:
    test_extracted_child_reuses_existing_frontier_row — seed_child_keys skips
      a key that already has a frontier row at any state; row count stays 1
      and the existing state is not reset.
    test_deferred_cap_rows_not_re_dequeued — 5 queued us-de rows with
      per_jurisdiction_cap=2 → exactly 2 ingested, 3 parked at deferred_cap,
      discover_and_bootstrap called only twice (loop terminates, no hang).
    test_crawl_with_dotted_section_child_key — cfr-40:261.4 (dot-suffix)
      stored as-is after crawl at parent.depth+1 (candidate_keys only strips
      parenthesised subsection suffixes, not dots).
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — Phase 5: Bounded Recursive Authority Crawl (#1994)

Overall: This is a solid, well-thought-out implementation. The BFS loop is provably terminating, the test suite is comprehensive, and the governance-graph N+1 fix is clean. A few issues worth addressing before merge.


Bugs / Correctness

1. Tool function uses hardcoded defaults instead of constants
opencontractserver/llms/tools/core_tools/corpus_references.py:

def crawl_authorities(
    *,
    max_depth: int = 2,       # should be C.CRAWL_DEFAULT_MAX_DEPTH
    min_demand: int = 2,       # should be C.CRAWL_DEFAULT_MIN_DEMAND
    max_authorities: int = 50, # should be C.CRAWL_DEFAULT_MAX_AUTHORITIES
    ...

These magic numbers duplicate the constants added in enrichment/constants.py. If the defaults are ever tuned, the tool function won't pick up the new values. Per CLAUDE.md's "No magic numbers" rule, all three should reference C.CRAWL_DEFAULT_*. (The service's own signature already does this correctly.)

2. CRAWL_DEFAULT_DOLLAR_BUDGET is dead code
opencontractserver/enrichment/constants.py adds:

CRAWL_DEFAULT_DOLLAR_BUDGET = 0.0  # 0 == unbounded; LLM-tier extraction is opt-in

This constant is never referenced anywhere in the diff — not in CrawlAuthoritiesService, the Celery task, or the tool function. CLAUDE.md is explicit: "No dead code — when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it." Either wire it up or remove it.

3. Token budget semantics: token_budget=0 skips budget checks
The loop guard is:

if token_budget and tokens_spent >= token_budget:

Using if token_budget (truthiness) means token_budget=0 disables the check entirely. This is documented in the docstring ("0 or negative = unbounded") and is probably intentional, but the Celery task schema sets "minimum": 0 for token_budget, which makes 0 a valid user input with a possibly surprising "disable" semantic. Consider using token_budget > 0 and documenting "pass 0 to disable" more prominently in the schema description.


Code Quality

4. Celery task repeats the None-guarding fallback pattern five times
corpus_analysis_tasks.py does:

max_depth=max_depth if max_depth is not None else C.CRAWL_DEFAULT_MAX_DEPTH,
min_demand=min_demand if min_demand is not None else C.CRAWL_DEFAULT_MIN_DEMAND,
...

five times. CrawlAuthoritiesService.crawl() already has these constants as default parameter values. The task could set its own defaults inline (the current approach) or just forward None and rely on the service defaults if the service accepted Optional[int]. The repetition is mild but inconsistent with DRY. Not a blocker, just worth noting.

5. _park_for_cap is a one-liner wrapper with a docstring longer than its body

@classmethod
def _park_for_cap(cls, row: AuthorityFrontier) -> None:
    """Park a jurisdiction-cap-blocked row at ``deferred_cap``..."""
    AuthorityFrontierService.mark(row, "deferred_cap")

The docstring explains the termination guarantee but duplicates comments that already appear at the call site. Inline the call and keep one comment explaining the termination guarantee. Per CLAUDE.md: "Don't explain WHAT the code does."

6. blocked_by_bound["min_demand_or_depth"] semantics are ambiguous
When dequeue_queued() returns nothing, the code counts all remaining queued rows as "blocked by min_demand or depth":

blocked_by_bound["min_demand_or_depth"] = (
    AuthorityFrontier.objects.filter(discovery_state="queued").count()
)

But some of those rows may have been excluded by depth, not demand, or vice versa. The key name implies both filters caused the block, but the actual count is a union. A comment or split into blocked_by_min_demand vs. blocked_by_max_depth would make the summary more actionable for callers.


Performance

7. _estimate_tokens performs per-authority file I/O in the hot loop

for doc in CorpusDocumentService.get_corpus_documents(user, corpus):
    try:
        total += len(read_field_file_text(doc.txt_extract_file) or "")

For authority corpora (one small document each) this is bounded as documented. But read_field_file_text opens and reads the file on every call — if storage is remote (S3/GCS), this is a network round-trip per authority per crawl iteration. Consider summing doc.txt_extract_file.size (an already-stored DB field) instead of reading the file contents, since the estimate only needs to be conservative, not exact.

8. _state_census() does a full-table GROUP BY after every crawl run

AuthorityFrontier.objects.values("discovery_state").annotate(n=Count("id"))

At current scale this is fine, but as the frontier grows this could be slow if called frequently. Low priority, just flag for when frontier reaches millions of rows.


Testing

The test coverage is excellent — bounds-termination proofs, idempotency, child seeding edge cases, and governance-graph regime fields are all covered. A few observations:

  • test_token_budget_halts asserts assertLessEqual(summary["authorities_ingested"], 1) — the actual value should always be exactly 1 given the mock (1000 tokens patched, 500 budget). Consider assertEqual(1) for a tighter assertion.

  • GovernanceGraphRegimeFieldTests calls EnrichmentService().apply(...) and AuthorityCorpusBootstrapper().bootstrap(...) in setUp, which are heavyweight operations. If these run synchronously in the test process they may be slow. Worth checking if they need @pytest.mark.slow or similar.


Migration

Migration 0089 is clean — a single AlterField on discovery_state to add deferred_cap. It depends on 0088_authorityfrontier_gate_states, which is in the stacked #2000. Merge order matters: #2000 must land first as described.


Summary

Category Items
Must fix before merge #1 (hardcoded defaults), #2 (dead code constant)
Should fix #3 (token_budget=0 semantic clarity), #6 (ambiguous blocked_by_bound key)
Nice to have #4, #5, #7, #8

The architecture is sound, the termination guarantees are well-argued, and the governance-graph N+1 fix is a clear improvement. Once the magic-number defaults and dead-code constant are addressed, this is ready to land.

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant