feat: PDF text extraction and per-page classification by sid732 · Pull Request #4 · sid732/LocalContextRouter

sid732 · 2026-06-18T16:12:25Z

Wires the page classifier to real PDFs: pull each page's embedded text layer and classify it.

What

Pdf — a context-managed handle over a pypdfium2 document (len, page_text, page_texts), releasing native resources on exit.
classify_pdf(path) — extracts every page's text and returns a per-page Classification.
Both exported from the package root.

Why pypdfium2

Permissively licensed and ships its own native PDFium library, so users install no system poppler. It also renders pages, which the OCR-fallback step will reuse.

Tests

Fixtures are generated in-process with fpdf2 (dev dependency): text extraction, multi-page iteration, text pages classified DIGITAL, and a blank page classified SCANNED.

Notes

Added a mypy override for pypdfium2 (no bundled type stubs).

Verified locally: ruff, ruff format, mypy (strict), pytest (15) all pass.

Use pypdfium2 (permissively licensed, ships its own native library) so there is no system poppler dependency. Add fpdf2 as a dev dependency for test fixtures and a mypy override for the stub-less binding.

Add the Pdf handle (context-managed pypdfium2 document) and classify_pdf, which pulls each page's embedded text layer and runs it through the page classifier.

Build PDFs with fpdf2 and assert text extraction, page iteration, digital text pages, and a blank page classified as scanned.

sid732 added 3 commits June 18, 2026 12:12

build(deps): add pypdfium2 for PDF text extraction

2fe6c88

Use pypdfium2 (permissively licensed, ships its own native library) so there is no system poppler dependency. Add fpdf2 as a dev dependency for test fixtures and a mypy override for the stub-less binding.

feat(core): extract and classify PDF page text

e48c0d6

Add the Pdf handle (context-managed pypdfium2 document) and classify_pdf, which pulls each page's embedded text layer and runs it through the page classifier.

test(core): cover PDF extraction and classification

9f5a5ed

Build PDFs with fpdf2 and assert text extraction, page iteration, digital text pages, and a blank page classified as scanned.

sid732 merged commit 16c920c into main Jun 18, 2026
5 checks passed

sid732 deleted the feat/pdf-extraction branch June 23, 2026 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PDF text extraction and per-page classification#4

feat: PDF text extraction and per-page classification#4
sid732 merged 3 commits into
mainfrom
feat/pdf-extraction

sid732 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sid732 commented Jun 18, 2026

What

Why pypdfium2

Tests

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant