feat: OCR fallback and page router by sid732 · Pull Request #5 · sid732/LocalContextRouter

sid732 · 2026-06-18T17:54:28Z

Completes the routing core: classify each PDF page, then either keep its embedded text or render it and run on-device OCR. End to end, a PDF now becomes per-page text tagged with its source.

What

route_pdf(path) → RouteResult. Digital pages keep extracted text (Source.TEXT); scanned/garbled pages are rendered and OCR'd (Source.OCR).
OCR bridge (ocr.py): locates lcr-ocr (LCR_OCR_BIN → PATH → bundled dev build), invokes it, parses JSON into OcrLine. ocr_png_text joins lines above a confidence threshold.
Rendering (Pdf.render_page_png): rasterizes a page via pypdfium2 and encodes PNG with Pillow.
Types: BoundingBox, OcrLine, Source, PageRoute, RouteResult.

Tests

Unit (no binary): JSON parsing, binary location/override, digital-page routing.
Integration (@pytest.mark.integration, skips if the binary is absent): real OCR of a rendered image, and a scanned image-only PDF routed through OCR. Fixtures build text and image-only PDFs in-process.
CI gains an integration job that builds lcr-ocr and runs the marked tests against it.

Deps

Adds pillow (PNG encoding of rendered pages).

Verified locally: ruff, ruff format, mypy (strict), and pytest (22, including 2 real-OCR integration tests) all pass.

Pillow encodes rendered pages to PNG for OCR. Register the 'integration' pytest marker for tests that invoke the built lcr-ocr binary.

Add BoundingBox, OcrLine, the Source enum (text vs ocr), PageRoute, and RouteResult (with joined-text helper).

Add Pdf.render_page_png, which rasterizes a page via pypdfium2 and encodes it to PNG bytes for the OCR fallback.

Locate the binary (LCR_OCR_BIN, then PATH, then the bundled dev build), invoke it, and parse its JSON into OcrLine. ocr_png_text OCRs PNG bytes and joins the lines above a confidence threshold.

route_pdf classifies each page and either keeps its extracted text or renders and OCRs it, returning a per-page RouteResult.

Unit-test JSON parsing, binary location, and digital routing; add binary-backed integration tests for OCR and scanned-page routing, with shared PDF-builder fixtures.

Build lcr-ocr and run the integration-marked tests against it on macOS.

sid732 added 7 commits June 18, 2026 13:53

build: add pillow dependency and integration test marker

ef6919b

Pillow encodes rendered pages to PNG for OCR. Register the 'integration' pytest marker for tests that invoke the built lcr-ocr binary.

feat(core): add OCR and routing data types

1fc425d

Add BoundingBox, OcrLine, the Source enum (text vs ocr), PageRoute, and RouteResult (with joined-text helper).

feat(core): render PDF pages to PNG

b20fff9

Add Pdf.render_page_png, which rasterizes a page via pypdfium2 and encodes it to PNG bytes for the OCR fallback.

feat(core): bridge to the lcr-ocr binary

d76e480

Locate the binary (LCR_OCR_BIN, then PATH, then the bundled dev build), invoke it, and parse its JSON into OcrLine. ocr_png_text OCRs PNG bytes and joins the lines above a confidence threshold.

feat(core): route pages between text and OCR

bddb3a8

route_pdf classifies each page and either keeps its extracted text or renders and OCRs it, returning a per-page RouteResult.

test(core): cover the OCR bridge and router

da07c89

Unit-test JSON parsing, binary location, and digital routing; add binary-backed integration tests for OCR and scanned-page routing, with shared PDF-builder fixtures.

ci: exercise the OCR pipeline in an integration job

05ae2db

Build lcr-ocr and run the integration-marked tests against it on macOS.

sid732 merged commit 27ccd29 into main Jun 18, 2026
6 checks passed

sid732 deleted the feat/ocr-router branch June 23, 2026 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OCR fallback and page router#5

feat: OCR fallback and page router#5
sid732 merged 7 commits into
mainfrom
feat/ocr-router

sid732 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sid732 commented Jun 18, 2026

What

Tests

Deps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant