Skip to content

Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998

Open
JSv4 wants to merge 5 commits into
mainfrom
feat/authority-phase2-llm-detection
Open

Authority discovery Phase 2 — Tier-2b LLM citation detection (#1991)#1998
JSv4 wants to merge 5 commits into
mainfrom
feat/authority-phase2-llm-detection

Conversation

@JSv4

@JSv4 JSv4 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Closes #1991. Part of the open-vocabulary authority discovery initiative (epic #1997); builds on Phase 0+1 (#1990, merged).

Summary

Adds an opt-in LLM detection tier that catches citations to legal authorities the deterministic grammars miss — prose references and obscure regimes ("the Guam Administrative Adjudication Law"). The trusted registry + grammar tiers are unchanged; the LLM tier is off by default and composes via the existing reconcile() precedence (registry > grammar > llm).

What it does

  • opencontractserver/enrichment/llm_citation_extractor.pyLLMCitationExtractor.aextract(text): offset-preserving sliding-window chunking → one-shot structured LLM call per chunk (via the sanctioned make_pydantic_ai_agent chokepoint, instructions=, output_type=, temperature=0) → every returned span is verified to exist in the source text (hallucinations are rejected; drifted offsets are re-anchored to the nearest occurrence of raw_text) → emits Candidates tagged detection_tier="llm". Cross-chunk dedup is on span (start, end) so an overlap-zone citation can't survive twice.
  • Review bucket — verified-but-low-confidence spans (< LLM_CONFIDENCE_FLOOR) are flagged needs_review: surfaced under discover()["review_candidates"], never auto-promoted to persistent mentions (apply() filters them out).
  • WiringEnrichmentService._resolutions runs the LLM tier (batched in a single async_to_sync over all docs) when DETECTION_TIER_LLM is in extra_tiers; discover(use_llm=False) and the discover_authorities tool gain a use_llm flag.
  • Privacy — module + tool docstrings explicitly note that use_llm=True transmits document text to the configured external LLM provider.

Test plan

  • CI green (full backend suite)
  • Local: 32 Phase-2 tests (27 extractor unit + 5 service integration) + 107 comprehensive enrichment regression — all pass. Async tests verified to genuinely execute under Django 5.2 TransactionTestCase (run with coroutine-never-awaited promoted to errors). No migration (no schema change); manage.py check clean (no E001); black/isort/flake8/mypy pass.
  • Covered: hallucination rejection, nearest-occurrence offset recovery, confidence floor (incl. exact-boundary 0.7), multi-chunk + overlap dedup, same-span/different-key dedup, reconcile(grammar beats llm), unknown jurisdiction/type, apply() never auto-promoting review-bucket items.

Notes / follow-ups

  • Adversarial review noted that the shared abuild_agent_model uses sync_to_async with the default thread_sensitive=True (less defensive than _db_sync_to_async's thread_sensitive=False). The bridge is functionally safe on the production path (agent tool → _db_sync_to_async thread → async_to_sync), so this pre-existing shared-infra nit is left out of scope for this feature PR.
  • This is the second of four stacked PRs for the initiative; Phase 3 (Authority discovery Phase 3 — frontier + source-provider registry #1992) builds on this branch.

JSv4 added 5 commits June 15, 2026 01:56
Implements Phase 2 of authority discovery: a sliding-window LLM pass that
uses pydantic-ai structured output to extract citation spans, then verifies
each span against the source text to reject hallucinations before returning
Candidate objects with detection_tier="llm".

- constants.py: add LLM_CONFIDENCE_FLOOR, LLM_CHUNK_WINDOW, LLM_CHUNK_OVERLAP,
  LLM_STRUCTURED_RETRIES for the Phase 2 tier
- llm_citation_extractor.py: CitationCandidate/ChunkCitationExtraction pydantic
  schemas; verify_and_place() with exact-match fast path and drift-recovery
  fallback; jurisdiction/authority-type normalisation maps; LLMCitationExtractor
  class with async aextract() sliding-window loop and dedup on (start,end,key)
- test_llm_citation_extractor.py: 22 tests covering pure helpers and async
  integration (TransactionTestCase + pydantic-ai TestModel)
…ities tool

- EnrichmentService._resolutions(): initialise LLMCitationExtractor outside the
  per-doc loop when DETECTION_TIER_LLM is in active_tiers; call
  async_to_sync(extractor.aextract)(text) per document and reconcile with
  the grammar/registry results.

- EnrichmentService.discover(): add use_llm=False param; when True, include
  DETECTION_TIER_LLM in the extra_tiers passed to _resolutions(). Split the
  result list into main_resolutions (needs_review=False) and review_resolutions
  (needs_review=True); roll up only main_resolutions into by_key /
  by_jurisdiction / by_authority_type; expose review_resolutions as a new
  review_candidates key in the return dict. total_candidates still counts all.

- EnrichmentService.apply(): strip needs_review candidates from resolutions
  before passing to EnrichmentWriter so low-confidence LLM detections are
  never persisted as CorpusReference rows without human review.

- corpus_references.py: add use_llm=False to discover_authorities() and
  adiscover_authorities(); forward to EnrichmentService.discover().

- tool_registry.py: add use_llm parameter tuple to the discover_authorities
  ToolDefinition.

- opencontractserver/tests/test_enrichment_llm_integration.py: four sync
  TransactionTestCase integration tests covering high-confidence LLM
  citations appearing in by_key, low-confidence ones going to
  review_candidates only, LLM not called when use_llm=False, and apply()
  not persisting low-confidence LLM citations.

- changelog.d/authority-llm-detection.added.md: fragment for the new feature.
…ence recovery, batched bridge, privacy notes, tests

- Force temperature=0 for all LLM providers in _one_shot_structured so the
  same citation yields a stable key across overlapping chunks
- Change cross-chunk dedup key from (start, end, canonical_key) to (start, end)
  so nondeterministic keys cannot produce duplicate Candidates for the same span
- verify_and_place drift recovery now collects all occurrences of raw_text in
  the chunk and picks the one nearest the LLM's claimed offset; rejects
  empty/whitespace raw_text before searching
- Add ge=0 to CitationCandidate.start and .end Field definitions
- Add privacy notices to module docstring and discover_authorities docstring
- Batch all per-document LLM extraction in _resolutions into one async_to_sync
  call instead of one event loop per document
- New tests: nearest-occurrence, same-span-two-keys dedup, multi-chunk absolute
  offsets, confidence floor boundary, unknown jurisdiction/type passthrough, and
  grammar-beats-LLM on overlap
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — Authority Discovery Phase 2: Tier-2b LLM Citation Detection

Overall: This is well-architected work. The hallucination rejection, confidence-floor review bucket, reconcile precedence, and privacy warnings are all handled correctly. Test coverage is solid (27 unit + 5 integration). The issues below are a mix of one real bug, several polish items, and a couple of minor design flags.


Potential Bug

window=0 causes an infinite loop (llm_citation_extractor.py:354-358)

step = self._window - self._overlap
if step <= 0:
    step = self._window   # still 0 if window=0

If a caller passes window=0, step is set to self._window which is also 0, and while pos < len(text) never terminates. The production constants are sane (2000/400) and the public constructor uses those defaults, but a guard would prevent a future caller from discovering this the hard way:

if self._window <= 0:
    raise ValueError(f"window must be positive, got {self._window!r}")

Magic Number in Output

services/enrichment_service.py:569raw_text[:120] in the review_candidates serialisation is a magic number. Given the project rule on constants, this should live in constants.py:

REVIEW_CANDIDATE_RAW_TEXT_MAX_LEN = 120

Sequential Processing Comment Mismatch

enrichment_service.py:472-483 — The comment says "Batch the LLM tier in ONE event loop (avoids per-document loop churn)", but _extract_all is a sequential for loop, not a batch. The real benefit is consolidating the async_to_sync bridge to one call rather than one per document — that's worth saying explicitly so the next reader doesn't wonder why asyncio.gather wasn't used:

# Run all documents sequentially in a single async_to_sync bridge.
# Sequential is intentional: avoids LLM-provider rate limit bursts
# and keeps per-document error isolation simple.

Jurisdiction Map — "Washington" Ambiguity

llm_citation_extractor.py:110 maps "washington" to "us-wa". The LLM will sometimes produce "Washington" for Washington D.C. (a distinct jurisdiction from Washington State). Consider mapping "washington dc" / "district of columbia" explicitly, or at minimum add a comment that "washington" is state-only and D.C. references will fall through to None.


Jurisdiction Map Coverage Gap

_JURISDICTION_MAP covers only 9 of 50 US states. That is expected for Phase 2, but unknown US states (Ohio, Georgia, Virginia, etc.) silently produce jurisdiction=None rather than any fallback. A warning-level log line on the miss path would make this visible during testing with real documents:

def _normalize_jurisdiction(s: str) -> str | None:
    key = s.strip().lower()
    if not key:
        return None
    result = _JURISDICTION_MAP.get(key)
    if result is None:
        logger.debug("Unknown jurisdiction %r — no canonical code", s)
    return result

Test Robustness

Hardcoded confidence boundary (test_llm_citation_extractor.py:1512)

results_below = await _run_with_confidence(0.69)

This should use C.LLM_CONFIDENCE_FLOOR - 0.01 so the test stays correct if the constant is ever tuned.

Fragile chunk-offset detection in dedup test (test_llm_citation_extractor.py:1356)

chunk_start_offset = text.index(chunk_text[:20])

This breaks if the first 20 characters of a later chunk happen to appear earlier in the text. In this specific test the text is "See " + citation + " " + "A"*200 so it's safe in practice, but it's brittle enough to flag. Using the actual pos counter (or just hard-coding the expected values from the known window/overlap sizes) would be more reliable.

Module-level patching vs unittest.mock.patch

Several tests do:

original = mod.abuild_agent_model
mod.abuild_agent_model = fake_build
try:
    ...
finally:
    mod.abuild_agent_model = original

unittest.mock.patch (or pytest-mock's mocker.patch.object) does the same thing but is exception-safe and produces cleaner test code. The manual pattern is fine but the finally block in test_discover_no_llm restores via original which was set to should_not_be_called, not the real function — actually wait, original is captured before the replacement, so that's fine. No bug, just a style note.


Minor API Inconsistency

discover() gains a use_llm=True flag, but apply() requires callers to pass extra_tiers=[C.DETECTION_TIER_GRAMMAR, C.DETECTION_TIER_LLM] explicitly. This is a deliberate design choice (more explicit for the write path), but it's surprising since discover() hides the tier list from callers. A brief comment in apply()'s docstring noting this asymmetry would help future callers.


What's Done Well

  • Hallucination rejection with nearest-occurrence drift recovery is solid and well-tested.
  • The needs_review review bucket is a clean safety valve — low-confidence items surface but cannot be auto-promoted.
  • Reconcile precedence (registry > grammar > LLM) is preserved and tested (test_reconcile_grammar_beats_llm_on_overlap).
  • instructions= (not system_prompt=) is correct per CLAUDE.md pitfall Bump coverage from 6.2 to 6.5.0 #14.
  • temperature: 0 for deterministic output in the overlap zone is a good call.
  • Privacy note is explicit in both the module docstring and the tool's docstring.
  • async_to_sync bridge is placed correctly — one call wrapping all documents rather than one per document.
  • Changelog fragment follows the project convention; no migration required.

Summary: One real bug (infinite loop on window=0), one constant to extract, and a handful of polish items. Nothing blocking ship from a correctness standpoint, but the window=0 guard is worth landing before merge.

async def test_aextract_empty_text(self):
"""Empty text → empty list, no model call."""
extractor = LLMCitationExtractor()
import opencontractserver.enrichment.llm_citation_extractor as mod
"""One valid citation → one Candidate with correct abs offsets, tier=llm."""
from pydantic_ai.models.test import TestModel

import opencontractserver.enrichment.llm_citation_extractor as mod
"""Low confidence (< LLM_CONFIDENCE_FLOOR) → Candidate present but needs_review=True."""
from pydantic_ai.models.test import TestModel

import opencontractserver.enrichment.llm_citation_extractor as mod
"""Citation raw_text absent from text → dropped."""
from pydantic_ai.models.test import TestModel

import opencontractserver.enrichment.llm_citation_extractor as mod
"""
from pydantic_ai.models.test import TestModel

import opencontractserver.enrichment.llm_citation_extractor as mod

async def test_aextract_multi_chunk(self):
"""Text longer than window → chunked; Candidate offsets are absolute."""
import opencontractserver.enrichment.llm_citation_extractor as mod

async def test_aextract_confidence_boundary(self):
"""Confidence exactly at LLM_CONFIDENCE_FLOOR → not needs_review; just below → needs_review."""
import opencontractserver.enrichment.llm_citation_extractor as mod

async def test_aextract_unknown_jurisdiction_type(self):
"""Unknown jurisdiction and authority_type produce a Candidate with None for both."""
import opencontractserver.enrichment.llm_citation_extractor as mod
_TEXT = "This entity is governed by the Guam Administrative Adjudication Law in all respects."

# Canonical key the LLM will return
_CANON_KEY = "act:guam-administrative-adjudication-law"
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Authority discovery Phase 2 — Tier-2b LLM citation detection

1 participant