Skip to content

feat: OCR fallback and page router#5

Merged
sid732 merged 7 commits into
mainfrom
feat/ocr-router
Jun 18, 2026
Merged

feat: OCR fallback and page router#5
sid732 merged 7 commits into
mainfrom
feat/ocr-router

Conversation

@sid732

@sid732 sid732 commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Completes the routing core: classify each PDF page, then either keep its embedded text or render it and run on-device OCR. End to end, a PDF now becomes per-page text tagged with its source.

What

  • route_pdf(path)RouteResult. Digital pages keep extracted text (Source.TEXT); scanned/garbled pages are rendered and OCR'd (Source.OCR).
  • OCR bridge (ocr.py): locates lcr-ocr (LCR_OCR_BINPATH → bundled dev build), invokes it, parses JSON into OcrLine. ocr_png_text joins lines above a confidence threshold.
  • Rendering (Pdf.render_page_png): rasterizes a page via pypdfium2 and encodes PNG with Pillow.
  • Types: BoundingBox, OcrLine, Source, PageRoute, RouteResult.

Tests

  • Unit (no binary): JSON parsing, binary location/override, digital-page routing.
  • Integration (@pytest.mark.integration, skips if the binary is absent): real OCR of a rendered image, and a scanned image-only PDF routed through OCR. Fixtures build text and image-only PDFs in-process.
  • CI gains an integration job that builds lcr-ocr and runs the marked tests against it.

Deps

Adds pillow (PNG encoding of rendered pages).

Verified locally: ruff, ruff format, mypy (strict), and pytest (22, including 2 real-OCR integration tests) all pass.

sid732 added 7 commits June 18, 2026 13:53
Pillow encodes rendered pages to PNG for OCR. Register the 'integration'
pytest marker for tests that invoke the built lcr-ocr binary.
Add BoundingBox, OcrLine, the Source enum (text vs ocr), PageRoute, and
RouteResult (with joined-text helper).
Add Pdf.render_page_png, which rasterizes a page via pypdfium2 and encodes
it to PNG bytes for the OCR fallback.
Locate the binary (LCR_OCR_BIN, then PATH, then the bundled dev build),
invoke it, and parse its JSON into OcrLine. ocr_png_text OCRs PNG bytes and
joins the lines above a confidence threshold.
route_pdf classifies each page and either keeps its extracted text or
renders and OCRs it, returning a per-page RouteResult.
Unit-test JSON parsing, binary location, and digital routing; add
binary-backed integration tests for OCR and scanned-page routing, with
shared PDF-builder fixtures.
Build lcr-ocr and run the integration-marked tests against it on macOS.
@sid732 sid732 merged commit 27ccd29 into main Jun 18, 2026
6 checks passed
@sid732 sid732 deleted the feat/ocr-router branch June 23, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant