Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998
Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998JSv4 wants to merge 5 commits into
Conversation
Implements Phase 2 of authority discovery: a sliding-window LLM pass that uses pydantic-ai structured output to extract citation spans, then verifies each span against the source text to reject hallucinations before returning Candidate objects with detection_tier="llm". - constants.py: add LLM_CONFIDENCE_FLOOR, LLM_CHUNK_WINDOW, LLM_CHUNK_OVERLAP, LLM_STRUCTURED_RETRIES for the Phase 2 tier - llm_citation_extractor.py: CitationCandidate/ChunkCitationExtraction pydantic schemas; verify_and_place() with exact-match fast path and drift-recovery fallback; jurisdiction/authority-type normalisation maps; LLMCitationExtractor class with async aextract() sliding-window loop and dedup on (start,end,key) - test_llm_citation_extractor.py: 22 tests covering pure helpers and async integration (TransactionTestCase + pydantic-ai TestModel)
…ities tool - EnrichmentService._resolutions(): initialise LLMCitationExtractor outside the per-doc loop when DETECTION_TIER_LLM is in active_tiers; call async_to_sync(extractor.aextract)(text) per document and reconcile with the grammar/registry results. - EnrichmentService.discover(): add use_llm=False param; when True, include DETECTION_TIER_LLM in the extra_tiers passed to _resolutions(). Split the result list into main_resolutions (needs_review=False) and review_resolutions (needs_review=True); roll up only main_resolutions into by_key / by_jurisdiction / by_authority_type; expose review_resolutions as a new review_candidates key in the return dict. total_candidates still counts all. - EnrichmentService.apply(): strip needs_review candidates from resolutions before passing to EnrichmentWriter so low-confidence LLM detections are never persisted as CorpusReference rows without human review. - corpus_references.py: add use_llm=False to discover_authorities() and adiscover_authorities(); forward to EnrichmentService.discover(). - tool_registry.py: add use_llm parameter tuple to the discover_authorities ToolDefinition. - opencontractserver/tests/test_enrichment_llm_integration.py: four sync TransactionTestCase integration tests covering high-confidence LLM citations appearing in by_key, low-confidence ones going to review_candidates only, LLM not called when use_llm=False, and apply() not persisting low-confidence LLM citations. - changelog.d/authority-llm-detection.added.md: fragment for the new feature.
…ence recovery, batched bridge, privacy notes, tests - Force temperature=0 for all LLM providers in _one_shot_structured so the same citation yields a stable key across overlapping chunks - Change cross-chunk dedup key from (start, end, canonical_key) to (start, end) so nondeterministic keys cannot produce duplicate Candidates for the same span - verify_and_place drift recovery now collects all occurrences of raw_text in the chunk and picks the one nearest the LLM's claimed offset; rejects empty/whitespace raw_text before searching - Add ge=0 to CitationCandidate.start and .end Field definitions - Add privacy notices to module docstring and discover_authorities docstring - Batch all per-document LLM extraction in _resolutions into one async_to_sync call instead of one event loop per document - New tests: nearest-occurrence, same-span-two-keys dedup, multi-chunk absolute offsets, confidence floor boundary, unknown jurisdiction/type passthrough, and grammar-beats-LLM on overlap
Code Review — Authority Discovery Phase 2: Tier-2b LLM Citation DetectionOverall: This is well-architected work. The hallucination rejection, confidence-floor review bucket, reconcile precedence, and privacy warnings are all handled correctly. Test coverage is solid (27 unit + 5 integration). The issues below are a mix of one real bug, several polish items, and a couple of minor design flags. Potential Bug
step = self._window - self._overlap
if step <= 0:
step = self._window # still 0 if window=0If a caller passes if self._window <= 0:
raise ValueError(f"window must be positive, got {self._window!r}")Magic Number in Output
REVIEW_CANDIDATE_RAW_TEXT_MAX_LEN = 120Sequential Processing Comment Mismatch
# Run all documents sequentially in a single async_to_sync bridge.
# Sequential is intentional: avoids LLM-provider rate limit bursts
# and keeps per-document error isolation simple.Jurisdiction Map — "Washington" Ambiguity
Jurisdiction Map Coverage Gap
def _normalize_jurisdiction(s: str) -> str | None:
key = s.strip().lower()
if not key:
return None
result = _JURISDICTION_MAP.get(key)
if result is None:
logger.debug("Unknown jurisdiction %r — no canonical code", s)
return resultTest RobustnessHardcoded confidence boundary ( results_below = await _run_with_confidence(0.69)This should use Fragile chunk-offset detection in dedup test ( chunk_start_offset = text.index(chunk_text[:20])This breaks if the first 20 characters of a later chunk happen to appear earlier in the text. In this specific test the text is Module-level patching vs Several tests do: original = mod.abuild_agent_model
mod.abuild_agent_model = fake_build
try:
...
finally:
mod.abuild_agent_model = original
Minor API Inconsistency
What's Done Well
Summary: One real bug (infinite loop on |
| async def test_aextract_empty_text(self): | ||
| """Empty text → empty list, no model call.""" | ||
| extractor = LLMCitationExtractor() | ||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
| """One valid citation → one Candidate with correct abs offsets, tier=llm.""" | ||
| from pydantic_ai.models.test import TestModel | ||
|
|
||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
| """Low confidence (< LLM_CONFIDENCE_FLOOR) → Candidate present but needs_review=True.""" | ||
| from pydantic_ai.models.test import TestModel | ||
|
|
||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
| """Citation raw_text absent from text → dropped.""" | ||
| from pydantic_ai.models.test import TestModel | ||
|
|
||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
| """ | ||
| from pydantic_ai.models.test import TestModel | ||
|
|
||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
|
|
||
| async def test_aextract_multi_chunk(self): | ||
| """Text longer than window → chunked; Candidate offsets are absolute.""" | ||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
|
|
||
| async def test_aextract_confidence_boundary(self): | ||
| """Confidence exactly at LLM_CONFIDENCE_FLOOR → not needs_review; just below → needs_review.""" | ||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
|
|
||
| async def test_aextract_unknown_jurisdiction_type(self): | ||
| """Unknown jurisdiction and authority_type produce a Candidate with None for both.""" | ||
| import opencontractserver.enrichment.llm_citation_extractor as mod |
| _TEXT = "This entity is governed by the Guam Administrative Adjudication Law in all respects." | ||
|
|
||
| # Canonical key the LLM will return | ||
| _CANON_KEY = "act:guam-administrative-adjudication-law" |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Closes #1991. Part of the open-vocabulary authority discovery initiative (epic #1997); builds on Phase 0+1 (#1990, merged).
Summary
Adds an opt-in LLM detection tier that catches citations to legal authorities the deterministic grammars miss — prose references and obscure regimes ("the Guam Administrative Adjudication Law"). The trusted registry + grammar tiers are unchanged; the LLM tier is off by default and composes via the existing
reconcile()precedence (registry > grammar > llm).What it does
opencontractserver/enrichment/llm_citation_extractor.py—LLMCitationExtractor.aextract(text): offset-preserving sliding-window chunking → one-shot structured LLM call per chunk (via the sanctionedmake_pydantic_ai_agentchokepoint,instructions=,output_type=,temperature=0) → every returned span is verified to exist in the source text (hallucinations are rejected; drifted offsets are re-anchored to the nearest occurrence ofraw_text) → emitsCandidates taggeddetection_tier="llm". Cross-chunk dedup is on span(start, end)so an overlap-zone citation can't survive twice.< LLM_CONFIDENCE_FLOOR) are flaggedneeds_review: surfaced underdiscover()["review_candidates"], never auto-promoted to persistent mentions (apply()filters them out).EnrichmentService._resolutionsruns the LLM tier (batched in a singleasync_to_syncover all docs) whenDETECTION_TIER_LLMis inextra_tiers;discover(use_llm=False)and thediscover_authoritiestool gain ause_llmflag.use_llm=Truetransmits document text to the configured external LLM provider.Test plan
TransactionTestCase(run withcoroutine-never-awaitedpromoted to errors). No migration (no schema change);manage.py checkclean (no E001); black/isort/flake8/mypy pass.apply()never auto-promoting review-bucket items.Notes / follow-ups
abuild_agent_modelusessync_to_asyncwith the defaultthread_sensitive=True(less defensive than_db_sync_to_async'sthread_sensitive=False). The bridge is functionally safe on the production path (agent tool →_db_sync_to_asyncthread →async_to_sync), so this pre-existing shared-infra nit is left out of scope for this feature PR.