Authority discovery Phase 5 — bounded recursive crawl + governance-graph regimes (#1994)#2001
Conversation
… child seeding - constants.py: six Phase-5 crawl bounds (CRAWL_DEFAULT_*) near WANTED_AUTHORITIES_TOP_KEYS - annotations/models.py: append deferred_cap to DISCOVERY_STATE_CHOICES (parks cap-blocked rows so dequeue_queued cannot loop on them) - migration 0089: AlterField records the new choice value - authority_frontier_service.py: dequeue_queued (provider-agnostic, demand-ranked, depth/demand-bounded) + seed_child_keys (rolls subsection keys to section roots via candidate_keys, idempotent get_or_create on unique canonical_key) - test_authority_frontier.py: DequeueQueuedTests (7 cases) + SeedChildKeysTests (10 cases) cover ordering, filtering, depth rollup, state preservation, and no-duplicate invariant; AuthorityFrontierGateStateTests extended with deferred_cap round-trip 46 tests, 0 failures
- Add crawl_authorities / acrawl_authorities tool functions (core_tools/corpus_references.py) that delegate to CrawlAuthoritiesService.crawl; export both from core_tools/__init__.py. - Register crawl_authorities in tool_registry.py at all four sites: AVAILABLE_TOOLS ToolDefinition (requires_corpus, requires_approval, requires_write_permission), lazy import in _populate(), and FUNCTION_MAP entry. - Extend GovernanceGraphService.build to derive jurisdiction/authority_type for doc_nodes (via classify_prefix) and all three fields for ghost_nodes via a single batch AuthorityFrontier queryset (eliminates N+1). - Add jurisdiction, authority_type, and discovery_state nullable string fields to GovernanceGraphNodeType; map them in both node branches of resolve_governance_graph. - Tests: 5 new CrawlAuthoritiesToolRegistryTests (all four registration sites) and 3 new GovernanceGraphRegimeFieldTests (statute doc_node fields, ghost without frontier row, ghost with frontier row carrying discovery_state). Closes #1994
…up/cap/dotted-key tests
- F1: Import and export crawl_authorities from opencontractserver/tasks/__init__.py
so Celery auto-discovery and the analyzer auto-sync can find the task.
- F2: Add per_jurisdiction_cap and token_budget params to crawl_authorities /
acrawl_authorities agent tools; forward to CrawlAuthoritiesService.crawl().
Defaults fall back to C.CRAWL_DEFAULT_* constants (no magic numbers). Add
both params to the crawl_authorities ToolDefinition in tool_registry.py.
- F3: Update GovernanceGraphNodeType.discovery_state description to enumerate
the full current state set including deferred_cap, discovered, resolved,
unsupported (matches migration choices).
- F4: Add one-line clarifying comments in governance_graph_service.py (doc
nodes have no discovery_state) and crawl_authorities_service.py (apply
scan is bounded because authority corpora hold one small doc per section).
- F5: Add three new tests to test_crawl_authorities.py:
test_extracted_child_reuses_existing_frontier_row — seed_child_keys skips
a key that already has a frontier row at any state; row count stays 1
and the existing state is not reset.
test_deferred_cap_rows_not_re_dequeued — 5 queued us-de rows with
per_jurisdiction_cap=2 → exactly 2 ingested, 3 parked at deferred_cap,
discover_and_bootstrap called only twice (loop terminates, no hang).
test_crawl_with_dotted_section_child_key — cfr-40:261.4 (dot-suffix)
stored as-is after crawl at parent.depth+1 (candidate_keys only strips
parenthesised subsection suffixes, not dots).
Code Review — Phase 5: Bounded Recursive Authority Crawl (#1994)Overall: This is a solid, well-thought-out implementation. The BFS loop is provably terminating, the test suite is comprehensive, and the governance-graph N+1 fix is clean. A few issues worth addressing before merge. Bugs / Correctness1. Tool function uses hardcoded defaults instead of constants def crawl_authorities(
*,
max_depth: int = 2, # should be C.CRAWL_DEFAULT_MAX_DEPTH
min_demand: int = 2, # should be C.CRAWL_DEFAULT_MIN_DEMAND
max_authorities: int = 50, # should be C.CRAWL_DEFAULT_MAX_AUTHORITIES
...These magic numbers duplicate the constants added in 2. CRAWL_DEFAULT_DOLLAR_BUDGET = 0.0 # 0 == unbounded; LLM-tier extraction is opt-inThis constant is never referenced anywhere in the diff — not in 3. Token budget semantics: if token_budget and tokens_spent >= token_budget:Using Code Quality4. Celery task repeats the max_depth=max_depth if max_depth is not None else C.CRAWL_DEFAULT_MAX_DEPTH,
min_demand=min_demand if min_demand is not None else C.CRAWL_DEFAULT_MIN_DEMAND,
...five times. 5. @classmethod
def _park_for_cap(cls, row: AuthorityFrontier) -> None:
"""Park a jurisdiction-cap-blocked row at ``deferred_cap``..."""
AuthorityFrontierService.mark(row, "deferred_cap")The docstring explains the termination guarantee but duplicates comments that already appear at the call site. Inline the call and keep one comment explaining the termination guarantee. Per CLAUDE.md: "Don't explain WHAT the code does." 6. blocked_by_bound["min_demand_or_depth"] = (
AuthorityFrontier.objects.filter(discovery_state="queued").count()
)But some of those rows may have been excluded by depth, not demand, or vice versa. The key name implies both filters caused the block, but the actual count is a union. A comment or split into Performance7. for doc in CorpusDocumentService.get_corpus_documents(user, corpus):
try:
total += len(read_field_file_text(doc.txt_extract_file) or "")For authority corpora (one small document each) this is bounded as documented. But 8. AuthorityFrontier.objects.values("discovery_state").annotate(n=Count("id"))At current scale this is fine, but as the frontier grows this could be slow if called frequently. Low priority, just flag for when frontier reaches millions of rows. TestingThe test coverage is excellent — bounds-termination proofs, idempotency, child seeding edge cases, and governance-graph regime fields are all covered. A few observations:
MigrationMigration Summary
The architecture is sound, the termination guarantees are well-argued, and the governance-graph N+1 fix is a clear improvement. Once the magic-number defaults and dead-code constant are addressed, this is ready to land. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Closes #1994. Final phase of the open-vocabulary authority discovery initiative (epic #1997). Stacked on #2000 (Phase 4) — review/merge that first; this PR targets the Phase 4 branch.
Summary
Makes the reference web self-expanding: a bounded breadth-first crawl that discovers the authorities a corpus cites, then the authorities those cite, reusing the Phase 3/4 provider→gate→bootstrap→relink pipeline — under hard, surfaced bounds that guarantee termination.
What it adds
CrawlAuthoritiesService(enrichment/services/crawl_authorities_service.py): seeds theAuthorityFrontierfromwanted_authorities(depth 0), then loops — dequeue highest-demand queued row →discover_and_bootstrap→ oningested, re-extract the new authority document's OWN outbound citations and seed them atdepth+1. Provably terminating: every iteration marks a row terminal, parks a per-jurisdiction-capped row atdeferred_cap(sodequeue_queuednever re-returns it), or hits a hard cap. Bounds (all surfaced in the summary, never silent):max_depth,max_authorities,token_budget,min_demandfloor,per_jurisdiction_cap. Idempotent: re-crawl creates zero duplicate authorities (get_or_createon the uniquecanonical_key).crawl_authoritiesCelery task (@corpus_analyzer_task, registered) +crawl_authoritiesagent tool (approval-gated, write-permission, exposes all five bounds).jurisdiction/authority_type, and ghost nodes carry the livediscovery_state(N+1-free batch join to the frontier) — the "weird ones" (queued / blocked / unlocated) are visually distinct. New nullable GraphQL fields, non-breaking.candidate_keyswas truncating dotted/hyphenated whole sections (cfr-40:261.4→cfr-40:261,usc-15:80a-1→usc-15:80a,sec-rule:10b-5→sec-rule:10b), breaking CFR/USC resolution. Now it strips only trailing parenthetical subsection groups. Validated across all callers (find_authority_target, wanted_authorities, relink, graph rollup).Test plan
0089clean; black/isort/flake8/mypy pass. Bounds-termination tests prove each cap sets the rightstop_reasonand the loop ends; idempotency + dedup + dotted-key tests included.Notes
manage.py sync_doc_analyzers(standard for@corpus_analyzer_task). Auto-installing the crawl as anADD_DOCUMENTCorpusAction (so the graph grows as documents arrive) is a documented follow-up — the tool + task are the primary surfaces here.