feat: OCR fallback and page router#5
Merged
Merged
Conversation
Pillow encodes rendered pages to PNG for OCR. Register the 'integration' pytest marker for tests that invoke the built lcr-ocr binary.
Add BoundingBox, OcrLine, the Source enum (text vs ocr), PageRoute, and RouteResult (with joined-text helper).
Add Pdf.render_page_png, which rasterizes a page via pypdfium2 and encodes it to PNG bytes for the OCR fallback.
Locate the binary (LCR_OCR_BIN, then PATH, then the bundled dev build), invoke it, and parse its JSON into OcrLine. ocr_png_text OCRs PNG bytes and joins the lines above a confidence threshold.
route_pdf classifies each page and either keeps its extracted text or renders and OCRs it, returning a per-page RouteResult.
Unit-test JSON parsing, binary location, and digital routing; add binary-backed integration tests for OCR and scanned-page routing, with shared PDF-builder fixtures.
Build lcr-ocr and run the integration-marked tests against it on macOS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completes the routing core: classify each PDF page, then either keep its embedded text or render it and run on-device OCR. End to end, a PDF now becomes per-page text tagged with its source.
What
route_pdf(path)→RouteResult. Digital pages keep extracted text (Source.TEXT); scanned/garbled pages are rendered and OCR'd (Source.OCR).ocr.py): locateslcr-ocr(LCR_OCR_BIN→PATH→ bundled dev build), invokes it, parses JSON intoOcrLine.ocr_png_textjoins lines above a confidence threshold.Pdf.render_page_png): rasterizes a page via pypdfium2 and encodes PNG with Pillow.BoundingBox,OcrLine,Source,PageRoute,RouteResult.Tests
@pytest.mark.integration, skips if the binary is absent): real OCR of a rendered image, and a scanned image-only PDF routed through OCR. Fixtures build text and image-only PDFs in-process.lcr-ocrand runs the marked tests against it.Deps
Adds
pillow(PNG encoding of rendered pages).Verified locally: ruff, ruff format, mypy (strict), and pytest (22, including 2 real-OCR integration tests) all pass.