Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991) by JSv4 · Pull Request #1998 · Open-Source-Legal/OpenContracts

JSv4 · 2026-06-15T07:57:34Z

Closes #1991. Part of the open-vocabulary authority discovery initiative (epic #1997); builds on Phase 0+1 (#1990, merged).

Summary

Adds an opt-in LLM detection tier that catches citations to legal authorities the deterministic grammars miss — prose references and obscure regimes ("the Guam Administrative Adjudication Law"). The trusted registry + grammar tiers are unchanged; the LLM tier is off by default and composes via the existing reconcile() precedence (registry > grammar > llm).

What it does

opencontractserver/enrichment/llm_citation_extractor.py — LLMCitationExtractor.aextract(text): offset-preserving sliding-window chunking → one-shot structured LLM call per chunk (via the sanctioned make_pydantic_ai_agent chokepoint, instructions=, output_type=, temperature=0) → every returned span is verified to exist in the source text (hallucinations are rejected; drifted offsets are re-anchored to the nearest occurrence of raw_text) → emits Candidates tagged detection_tier="llm". Cross-chunk dedup is on span (start, end) so an overlap-zone citation can't survive twice.
Review bucket — verified-but-low-confidence spans (< LLM_CONFIDENCE_FLOOR) are flagged needs_review: surfaced under discover()["review_candidates"], never auto-promoted to persistent mentions (apply() filters them out).
Wiring — EnrichmentService._resolutions runs the LLM tier (batched in a single async_to_sync over all docs) when DETECTION_TIER_LLM is in extra_tiers; discover(use_llm=False) and the discover_authorities tool gain a use_llm flag.
Privacy — module + tool docstrings explicitly note that use_llm=True transmits document text to the configured external LLM provider.

Test plan

CI green (full backend suite)
Local: 32 Phase-2 tests (27 extractor unit + 5 service integration) + 107 comprehensive enrichment regression — all pass. Async tests verified to genuinely execute under Django 5.2 TransactionTestCase (run with coroutine-never-awaited promoted to errors). No migration (no schema change); manage.py check clean (no E001); black/isort/flake8/mypy pass.
Covered: hallucination rejection, nearest-occurrence offset recovery, confidence floor (incl. exact-boundary 0.7), multi-chunk + overlap dedup, same-span/different-key dedup, reconcile(grammar beats llm), unknown jurisdiction/type, apply() never auto-promoting review-bucket items.

Notes / follow-ups

Adversarial review noted that the shared abuild_agent_model uses sync_to_async with the default thread_sensitive=True (less defensive than _db_sync_to_async's thread_sensitive=False). The bridge is functionally safe on the production path (agent tool → _db_sync_to_async thread → async_to_sync), so this pre-existing shared-infra nit is left out of scope for this feature PR.
This is the second of four stacked PRs for the initiative; Phase 3 (Authority discovery Phase 3 — frontier + source-provider registry #1992) builds on this branch.

Implements Phase 2 of authority discovery: a sliding-window LLM pass that uses pydantic-ai structured output to extract citation spans, then verifies each span against the source text to reject hallucinations before returning Candidate objects with detection_tier="llm". - constants.py: add LLM_CONFIDENCE_FLOOR, LLM_CHUNK_WINDOW, LLM_CHUNK_OVERLAP, LLM_STRUCTURED_RETRIES for the Phase 2 tier - llm_citation_extractor.py: CitationCandidate/ChunkCitationExtraction pydantic schemas; verify_and_place() with exact-match fast path and drift-recovery fallback; jurisdiction/authority-type normalisation maps; LLMCitationExtractor class with async aextract() sliding-window loop and dedup on (start,end,key) - test_llm_citation_extractor.py: 22 tests covering pure helpers and async integration (TransactionTestCase + pydantic-ai TestModel)

…ities tool - EnrichmentService._resolutions(): initialise LLMCitationExtractor outside the per-doc loop when DETECTION_TIER_LLM is in active_tiers; call async_to_sync(extractor.aextract)(text) per document and reconcile with the grammar/registry results. - EnrichmentService.discover(): add use_llm=False param; when True, include DETECTION_TIER_LLM in the extra_tiers passed to _resolutions(). Split the result list into main_resolutions (needs_review=False) and review_resolutions (needs_review=True); roll up only main_resolutions into by_key / by_jurisdiction / by_authority_type; expose review_resolutions as a new review_candidates key in the return dict. total_candidates still counts all. - EnrichmentService.apply(): strip needs_review candidates from resolutions before passing to EnrichmentWriter so low-confidence LLM detections are never persisted as CorpusReference rows without human review. - corpus_references.py: add use_llm=False to discover_authorities() and adiscover_authorities(); forward to EnrichmentService.discover(). - tool_registry.py: add use_llm parameter tuple to the discover_authorities ToolDefinition. - opencontractserver/tests/test_enrichment_llm_integration.py: four sync TransactionTestCase integration tests covering high-confidence LLM citations appearing in by_key, low-confidence ones going to review_candidates only, LLM not called when use_llm=False, and apply() not persisting low-confidence LLM citations. - changelog.d/authority-llm-detection.added.md: fragment for the new feature.

…ence recovery, batched bridge, privacy notes, tests - Force temperature=0 for all LLM providers in _one_shot_structured so the same citation yields a stable key across overlapping chunks - Change cross-chunk dedup key from (start, end, canonical_key) to (start, end) so nondeterministic keys cannot produce duplicate Candidates for the same span - verify_and_place drift recovery now collects all occurrences of raw_text in the chunk and picks the one nearest the LLM's claimed offset; rejects empty/whitespace raw_text before searching - Add ge=0 to CitationCandidate.start and .end Field definitions - Add privacy notices to module docstring and discover_authorities docstring - Batch all per-document LLM extraction in _resolutions into one async_to_sync call instead of one event loop per document - New tests: nearest-occurrence, same-span-two-keys dedup, multi-chunk absolute offsets, confidence floor boundary, unknown jurisdiction/type passthrough, and grammar-beats-LLM on overlap

…xtractor

claude · 2026-06-15T08:00:20Z

Code Review — Authority Discovery Phase 2: Tier-2b LLM Citation Detection

Overall: This is well-architected work. The hallucination rejection, confidence-floor review bucket, reconcile precedence, and privacy warnings are all handled correctly. Test coverage is solid (27 unit + 5 integration). The issues below are a mix of one real bug, several polish items, and a couple of minor design flags.

Potential Bug

window=0 causes an infinite loop (llm_citation_extractor.py:354-358)

step = self._window - self._overlap
if step <= 0:
    step = self._window   # still 0 if window=0

If a caller passes window=0, step is set to self._window which is also 0, and while pos < len(text) never terminates. The production constants are sane (2000/400) and the public constructor uses those defaults, but a guard would prevent a future caller from discovering this the hard way:

if self._window <= 0:
    raise ValueError(f"window must be positive, got {self._window!r}")

Magic Number in Output

services/enrichment_service.py:569 — raw_text[:120] in the review_candidates serialisation is a magic number. Given the project rule on constants, this should live in constants.py:

REVIEW_CANDIDATE_RAW_TEXT_MAX_LEN = 120

Sequential Processing Comment Mismatch

enrichment_service.py:472-483 — The comment says "Batch the LLM tier in ONE event loop (avoids per-document loop churn)", but _extract_all is a sequential for loop, not a batch. The real benefit is consolidating the async_to_sync bridge to one call rather than one per document — that's worth saying explicitly so the next reader doesn't wonder why asyncio.gather wasn't used:

# Run all documents sequentially in a single async_to_sync bridge.
# Sequential is intentional: avoids LLM-provider rate limit bursts
# and keeps per-document error isolation simple.

Jurisdiction Map — "Washington" Ambiguity

llm_citation_extractor.py:110 maps "washington" to "us-wa". The LLM will sometimes produce "Washington" for Washington D.C. (a distinct jurisdiction from Washington State). Consider mapping "washington dc" / "district of columbia" explicitly, or at minimum add a comment that "washington" is state-only and D.C. references will fall through to None.

Jurisdiction Map Coverage Gap

_JURISDICTION_MAP covers only 9 of 50 US states. That is expected for Phase 2, but unknown US states (Ohio, Georgia, Virginia, etc.) silently produce jurisdiction=None rather than any fallback. A warning-level log line on the miss path would make this visible during testing with real documents:

def _normalize_jurisdiction(s: str) -> str | None:
    key = s.strip().lower()
    if not key:
        return None
    result = _JURISDICTION_MAP.get(key)
    if result is None:
        logger.debug("Unknown jurisdiction %r — no canonical code", s)
    return result

Test Robustness

Hardcoded confidence boundary (test_llm_citation_extractor.py:1512)

results_below = await _run_with_confidence(0.69)

This should use C.LLM_CONFIDENCE_FLOOR - 0.01 so the test stays correct if the constant is ever tuned.

Fragile chunk-offset detection in dedup test (test_llm_citation_extractor.py:1356)

chunk_start_offset = text.index(chunk_text[:20])

This breaks if the first 20 characters of a later chunk happen to appear earlier in the text. In this specific test the text is "See " + citation + " " + "A"*200 so it's safe in practice, but it's brittle enough to flag. Using the actual pos counter (or just hard-coding the expected values from the known window/overlap sizes) would be more reliable.

Module-level patching vs unittest.mock.patch

Several tests do:

original = mod.abuild_agent_model
mod.abuild_agent_model = fake_build
try:
    ...
finally:
    mod.abuild_agent_model = original

unittest.mock.patch (or pytest-mock's mocker.patch.object) does the same thing but is exception-safe and produces cleaner test code. The manual pattern is fine but the finally block in test_discover_no_llm restores via original which was set to should_not_be_called, not the real function — actually wait, original is captured before the replacement, so that's fine. No bug, just a style note.

Minor API Inconsistency

discover() gains a use_llm=True flag, but apply() requires callers to pass extra_tiers=[C.DETECTION_TIER_GRAMMAR, C.DETECTION_TIER_LLM] explicitly. This is a deliberate design choice (more explicit for the write path), but it's surprising since discover() hides the tier list from callers. A brief comment in apply()'s docstring noting this asymmetry would help future callers.

What's Done Well

Hallucination rejection with nearest-occurrence drift recovery is solid and well-tested.
The needs_review review bucket is a clean safety valve — low-confidence items surface but cannot be auto-promoted.
Reconcile precedence (registry > grammar > LLM) is preserved and tested (test_reconcile_grammar_beats_llm_on_overlap).
instructions= (not system_prompt=) is correct per CLAUDE.md pitfall Bump coverage from 6.2 to 6.5.0 #14.
temperature: 0 for deterministic output in the overlap zone is a good call.
Privacy note is explicit in both the module docstring and the tool's docstring.
async_to_sync bridge is placed correctly — one call wrapping all documents rather than one per document.
Changelog fragment follows the project convention; no migration required.

Summary: One real bug (infinite loop on window=0), one constant to extract, and a handful of polish items. Nothing blocking ship from a correctness standpoint, but the window=0 guard is worth landing before merge.

+    async def test_aextract_empty_text(self):
+        """Empty text → empty list, no model call."""
+        extractor = LLMCitationExtractor()
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+        """One valid citation → one Candidate with correct abs offsets, tier=llm."""
+        from pydantic_ai.models.test import TestModel
+
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+        """Low confidence (< LLM_CONFIDENCE_FLOOR) → Candidate present but needs_review=True."""
+        from pydantic_ai.models.test import TestModel
+
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+        """Citation raw_text absent from text → dropped."""
+        from pydantic_ai.models.test import TestModel
+
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+        """
+        from pydantic_ai.models.test import TestModel
+
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+
+    async def test_aextract_multi_chunk(self):
+        """Text longer than window → chunked; Candidate offsets are absolute."""
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+
+    async def test_aextract_confidence_boundary(self):
+        """Confidence exactly at LLM_CONFIDENCE_FLOOR → not needs_review; just below → needs_review."""
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+
+    async def test_aextract_unknown_jurisdiction_type(self):
+        """Unknown jurisdiction and authority_type produce a Candidate with None for both."""
+        import opencontractserver.enrichment.llm_citation_extractor as mod


+_TEXT = "This entity is governed by the Guam Administrative Adjudication Law in all respects."
+
+# Canonical key the LLM will return
+_CANON_KEY = "act:guam-administrative-adjudication-law"


codecov · 2026-06-15T08:21:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

JSv4 added 5 commits June 15, 2026 01:56

Fix flake8 line length in LLM integration test

5b407b6

Restore make_pydantic_ai_agent chokepoint and fix mypy types in LLM e…

bc877a5

…xtractor

github-code-quality Bot found potential problems Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998

Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998
JSv4 wants to merge 5 commits into
mainfrom
feat/authority-phase2-llm-detection

JSv4 commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026

Uh oh!

codecov Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JSv4 commented Jun 15, 2026

Summary

What it does

Test plan

Notes / follow-ups

Uh oh!

claude Bot commented Jun 15, 2026

Code Review — Authority Discovery Phase 2: Tier-2b LLM Citation Detection

Potential Bug

Magic Number in Output

Sequential Processing Comment Mismatch

Jurisdiction Map — "Washington" Ambiguity

Jurisdiction Map Coverage Gap

Test Robustness

Minor API Inconsistency

What's Done Well

Uh oh!

codecov Bot commented Jun 15, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant