diff --git a/.claude/skills/local-context-router/SKILL.md b/.claude/skills/local-context-router/SKILL.md index e7d47b8..8845632 100644 --- a/.claude/skills/local-context-router/SKILL.md +++ b/.claude/skills/local-context-router/SKILL.md @@ -2,55 +2,57 @@ name: local-context-router description: >- Preflight a PDF, scan, or screenshot locally before sending it to the model. - Extracts the embedded text layer for free, OCRs image-only pages on-device - with Apple Vision, and flags only genuinely visual pages (tables, charts, - diagrams) for the vision model — cutting vision-token cost. Use whenever the - user shares a PDF or image to read, summarize, or extract from. + Extracts the embedded text layer, OCRs image-only pages on-device with Apple + Vision, and flags genuinely visual pages (tables, charts, diagrams) for the + vision model, which cuts vision-token cost. Use whenever the user shares a PDF + or image to read, summarize, or extract from. --- # Local Context Router -Multimodal models read a PDF by extracting its text *and* rendering every page -to an image, billing for both. For text-heavy pages that is a 2–10× token tax -for no added signal. This skill spends cheap local compute first and only pays -for vision when a page's meaning actually lives in its pixels. +A multimodal model reads a PDF by extracting its text and rendering every page to +an image, then paying for both. On a page that is mostly prose, the image is +wasted spend. Run this preflight first and send the model only what each page +needs. ## When to use -Use this **before** attaching a PDF, scan, or screenshot to the conversation — -whenever the user wants you to read, summarize, or extract from a document. +Before reading, summarizing, or extracting from a PDF, scan, or screenshot the +user has shared. -## How to run +## Requirements -Run the preflight script on the file. It picks the cheapest faithful source per -page and prints the result as JSON: +The `localcontextrouter` package must be installed (`pip install localcontextrouter`, +macOS). It provides the `localctx` command used below. + +## Run + +Route the document and read the JSON, rendering any visual pages into a folder: ```sh -python "${CLAUDE_SKILL_DIR}/scripts/preflight.py" --json --vision-dir "${CLAUDE_SKILL_DIR}/.cache" +localctx --json --vision-dir ./lcr-pages ``` -- `` is the PDF or image to analyze. -- `--vision-dir` is where rendered images of visual pages are written. +If `localctx` is not on the PATH, run the bundled script by its path inside this +skill folder instead: + +```sh +python scripts/preflight.py --json --vision-dir ./lcr-pages +``` -## How to use the result +## Use the result -The JSON has a `pages` array and a `tokens_saved` total. For each page: +The JSON has `tokens_saved` and a `pages` array. Each page carries `source`, +`text`, `text_tokens`, `image_tokens`, and `image`: -- **`source: "text"`** — use the page's `text` directly. Do **not** attach the - image; it adds cost without information. -- **`source: "ocr"`** — the page was image-only and has been OCR'd on-device; - use the returned `text`. -- **`source: "vision"`** — the page is a table, chart, or diagram whose meaning - is visual. Attach the rendered image at `image` to the conversation so the - vision model can read it. The `text` is a rough fallback only. +- `source: "text"`: use `text` directly; do not attach the image. +- `source: "ocr"`: the page was image-only and has been OCR'd on-device; use `text`. +- `source: "vision"`: the page is a table, chart, or diagram; attach the image at + `image` so the model can read it. The `text` is a rough fallback only. -Assemble the per-page text in order for the parts you can read as text, and -attach images only for the `vision` pages. Mention `tokens_saved` if the user -cares about cost. +Assemble the text and OCR pages in reading order, attach images only for the +vision pages, and mention `tokens_saved` if the user cares about cost. ## Notes -- Everything runs locally and offline; no document leaves the machine during - preflight. -- Requires macOS (on-device OCR uses Apple Vision) and the `localcontextrouter` - package importable by the Python interpreter. +Everything runs locally and offline; the document does not leave the machine. diff --git a/.gitignore b/.gitignore index 8e80295..f35ccae 100644 --- a/.gitignore +++ b/.gitignore @@ -38,5 +38,5 @@ src/localcontextrouter/_bin/ /tmp/ *.log -# Claude Code local (user-specific) settings +# Local agent settings (user-specific) .claude/settings.local.json diff --git a/CHANGELOG.md b/CHANGELOG.md index 30ecd0b..bf048fb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -21,15 +21,17 @@ First release. estimate, following each provider's documented tokenization. - `route_pdf`, which routes each page to text, OCR, or vision and reports the tokens saved versus sending every page as an image. -- Routed text is normalized — stray control characters (e.g. PDF discretionary - hyphens) are stripped and line endings collapsed — while classification still +- Routed text is normalized: stray control characters (such as PDF discretionary + hyphens) are stripped and line endings collapsed, while classification still runs on the raw text layer. - `localctx` command-line interface. - A `local-context-router` Agent Skill for Claude Code and Codex. ### Notes -- macOS only; OCR uses the Apple Vision framework. +- macOS only; OCR uses the Apple Vision framework and needs a normal macOS + graphics environment, so it will not run inside a headless sandbox that lacks + one. - The macOS wheel is a `universal2` platform wheel that bundles the `lcr-ocr` binary, so OCR works out of the box. `LCR_OCR_BIN` overrides the bundled copy. diff --git a/README.md b/README.md index d64ee92..7297713 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,31 @@ # LocalContextRouter -> Stop paying the vision-token tax. Decide locally — text, OCR, or vision — *before* a document ever reaches a multimodal LLM. +Decide locally how each page of a document should reach a multimodal model: +as extracted text, on-device OCR, or a rendered image. That keeps you from +paying for vision tokens on pages that are only text. -LocalContextRouter is a **preflight layer** for document-heavy LLM workflows. Given a -PDF or image, it inspects the content on your machine and decides the cheapest path -that still preserves accuracy: +A multimodal model reads a PDF by pulling its text *and* rendering every page to +an image, then billing for both. On a text page that image runs roughly +1,300 to 4,800 tokens while the same page as plain text is 400 to 800. For a +text-dominant document that is several times the cost for nothing extra. +LocalContextRouter does the cheap work on your machine first and tells you what +each page actually needs. -- **Text-layer PDF** → extract text locally (near-free). -- **Scanned / image-only page** → OCR on-device with Apple Vision. -- **Chart / table / diagram / layout-heavy page** → keep the page as an image for the vision model, where the pixels actually carry meaning. +It does not call a model. It returns a per-page decision and the text to send; +your application still makes the call. -It never calls an LLM itself. It prepares the cheapest faithful context and hands you -back a routing decision plus a token-savings estimate. Your application still owns the -model call. +## How it decides -## Why +For each page: -Multimodal models read a PDF by extracting its text *and* rendering every page to an -image, then billing for both. A text-heavy page sent as an image can cost -**1,300–4,800 tokens**; the same page as extracted text costs **400–800**. For -text-dominant documents that is a 2–10× tax for zero added signal. +- A usable text layer that is mostly prose: use the extracted text. +- A text layer dominated by a table, chart, or diagram: send the page as an + image, where the layout carries the meaning. +- No usable text, such as a scan or a photo: recognize it on-device with + Apple's Vision framework. -LocalContextRouter spends cheap local compute to avoid that tax — and only escalates -to vision when the page genuinely needs it. +The result also reports how many tokens you saved against sending every page as +an image. ## Install @@ -30,48 +33,79 @@ to vision when the page genuinely needs it. pip install localcontextrouter ``` -The macOS wheel bundles the on-device OCR binary (`lcr-ocr`, a universal2 build), -so OCR works out of the box — no extra setup. To override it (e.g. a locally built -binary), set `LCR_OCR_BIN` to its path. +macOS only. The wheel bundles a universal (Apple Silicon and Intel) OCR binary, +so text recognition works with no extra setup. -## Use +## Command line -There is no server and no background process — everything runs on demand and exits. +```sh +localctx invoice.pdf +localctx invoice.pdf --json +localctx scan.png +``` -### Command line +`localctx invoice.pdf` prints each page, the source chosen for it, and the +tokens saved: -```sh -localctx report.pdf # human summary + tokens saved -localctx report.pdf --json # machine-readable -localctx report.pdf --vision-dir ./out # render visual pages to ./out +``` +Document: invoice.pdf (3 pages) +Tokens saved vs sending every page as an image: 3085 + +Page 1 [text] +ACME Corp, Invoice #4471 ... + +Page 2 [vision] +Quarterly results by segment ... + +Page 3 [ocr] +SCANNED RECEIPT TOTAL 42.00 ``` -### Library +Add `--vision-dir DIR` to render the pages that should go to the model as images +into `DIR`; their paths are then listed in the output and the JSON. + +## In code ```python from localcontextrouter import route_pdf, Source -result = route_pdf("report.pdf") +result = route_pdf("invoice.pdf") for page in result.pages: if page.source is Source.VISION: - ... # send the rendered page image to the model + send_image(page.index) # the page's meaning is visual else: - ... # use page.text (extracted or OCR'd) + send_text(page.text) # extracted or recognized text -print(result.text) # all text-routable pages joined -print(result.tokens_saved) # tokens avoided vs sending every page as an image +print(result.tokens_saved) ``` -### Agent Skill +Every page also carries an estimate of its cost both ways, as +`page.tokens.text_tokens` and `page.tokens.image_tokens`. + +## As an agent skill + +`local-context-router` is an Agent Skill in the open `SKILL.md` format, so it +works in Claude Code and other compatible agents. It lives in this repository +under `.claude/skills/local-context-router`; copy that folder into your agent's +skills directory: + +```sh +cp -r .claude/skills/local-context-router ~/.claude/skills/ +``` -The `local-context-router` skill (in `.claude/skills/`) runs the same preflight -inside Claude Code or Codex — copy it into your `.claude/skills/` (or `~/.claude/skills/`). +With the package installed, the agent runs the preflight on any PDF or image you +share, then uses the text for the cheap pages and attaches images only for the +visual ones. -## Requirements +## Requirements and scope -- macOS 10.15+ (on-device OCR uses the Apple Vision framework) -- Python 3.10+ +- macOS 11 or newer. Recognition uses the Apple Vision framework and needs a + normal macOS graphics environment; it will not run inside a headless sandbox + that lacks one. +- Python 3.10 or newer. +- The scope is per-page routing, on-device OCR, and a token estimate. Retrieval + over very large documents is out of scope. ## License -[MIT](LICENSE) © 2026 Siddharth Nashikkar +MIT. See [LICENSE](LICENSE). diff --git a/ocr/README.md b/ocr/README.md index 6a5804f..ad1312b 100644 --- a/ocr/README.md +++ b/ocr/README.md @@ -1,7 +1,7 @@ # lcr-ocr On-device OCR binary used by LocalContextRouter. Wraps the Apple Vision -framework — fully offline, no network, no entitlements, and no Screen Recording +framework, fully offline, no network, no entitlements, and no Screen Recording permission (it reads image files you pass in, it does not capture the screen). ## Build @@ -59,9 +59,9 @@ Follows the `sysexits.h` convention so callers can branch on failure mode: ## Layout -- `Sources/LCROCR` — reusable library: image loading, the Vision engine, and the result models. -- `Sources/lcr-ocr` — thin CLI over the library. -- `Tests/LCROCRTests` — engine tests that render text in-process (no binary fixtures). +- `Sources/LCROCR`, reusable library: image loading, the Vision engine, and the result models. +- `Sources/lcr-ocr`, thin CLI over the library. +- `Tests/LCROCRTests`, engine tests that render text in-process (no binary fixtures). ## Requirements diff --git a/ocr/Sources/LCROCR/ImageLoading.swift b/ocr/Sources/LCROCR/ImageLoading.swift index c8f0cee..df8ab60 100644 --- a/ocr/Sources/LCROCR/ImageLoading.swift +++ b/ocr/Sources/LCROCR/ImageLoading.swift @@ -20,7 +20,7 @@ public enum ImageLoadError: Error, CustomStringConvertible { /// Loads bitmaps from disk into `CGImage` using ImageIO. /// /// ImageIO is used instead of AppKit so the binary runs headless (no window -/// server) — important for CI and for invocation from a CLI. +/// server), important for CI and for invocation from a CLI. public enum ImageLoader { /// Decode the first image in the file at `path`. public static func loadCGImage(atPath path: String) throws -> CGImage { diff --git a/pyproject.toml b/pyproject.toml index 7aaf2c0..61b12ac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "localcontextrouter" -description = "Preflight router that picks the cheapest faithful path — text, OCR, or vision — before a document reaches a multimodal LLM." +description = "Preflight router that picks the cheapest faithful path (text, OCR, or vision) before a document reaches a multimodal LLM." readme = "README.md" requires-python = ">=3.10" license = "MIT" diff --git a/src/localcontextrouter/__init__.py b/src/localcontextrouter/__init__.py index 0e1d024..091284e 100644 --- a/src/localcontextrouter/__init__.py +++ b/src/localcontextrouter/__init__.py @@ -1,4 +1,4 @@ -"""LocalContextRouter — cheapest faithful path for documents bound for a multimodal LLM.""" +"""LocalContextRouter, cheapest faithful path for documents bound for a multimodal LLM.""" from .classify import classify_text, compute_signals from .detect import is_vision_worthy diff --git a/src/localcontextrouter/classify.py b/src/localcontextrouter/classify.py index b66be3a..63f3cfc 100644 --- a/src/localcontextrouter/classify.py +++ b/src/localcontextrouter/classify.py @@ -5,7 +5,7 @@ absent (:class:`PageClass.SCANNED`), or present but broken (:class:`PageClass.GARBLED`). The two latter cases route to OCR downstream. -Thresholds are deliberately conservative — when in doubt the page is sent to +Thresholds are deliberately conservative, when in doubt the page is sent to OCR, since a wrong "digital" verdict silently feeds garbage to the model. """ diff --git a/src/localcontextrouter/cli.py b/src/localcontextrouter/cli.py index 3744142..6e0d6d0 100644 --- a/src/localcontextrouter/cli.py +++ b/src/localcontextrouter/cli.py @@ -1,4 +1,4 @@ -"""``localctx`` — route a document and report the cheapest faithful source per page.""" +"""``localctx``, route a document and report the cheapest faithful source per page.""" from __future__ import annotations diff --git a/src/localcontextrouter/detect.py b/src/localcontextrouter/detect.py index 76014c9..85ac8f4 100644 --- a/src/localcontextrouter/detect.py +++ b/src/localcontextrouter/detect.py @@ -3,7 +3,7 @@ Some pages carry a perfectly good text layer yet still lose their meaning when flattened to text: tables, charts, diagrams, and figure-heavy layouts. Those are worth the vision-token cost. This module decides that from cheap layout features -(:class:`~.models.PageFeatures`) — no rendering and no ML. +(:class:`~.models.PageFeatures`), no rendering and no ML. """ from __future__ import annotations diff --git a/src/localcontextrouter/models.py b/src/localcontextrouter/models.py index daa9dd2..718383a 100644 --- a/src/localcontextrouter/models.py +++ b/src/localcontextrouter/models.py @@ -10,13 +10,13 @@ class PageClass(str, Enum): """How a PDF page should be sourced before it reaches an LLM.""" DIGITAL = "digital" - """A usable embedded text layer is present — extract the text directly.""" + """A usable embedded text layer is present, extract the text directly.""" SCANNED = "scanned" - """Little or no text layer — the page is image-only and needs OCR.""" + """Little or no text layer, the page is image-only and needs OCR.""" GARBLED = "garbled" - """A text layer exists but is broken (unmapped glyphs) — OCR is safer.""" + """A text layer exists but is broken (unmapped glyphs), OCR is safer.""" @dataclass(frozen=True) @@ -74,7 +74,7 @@ class Source(str, Enum): """Produced by on-device OCR after rendering the page.""" VISION = "vision" - """Send the page to a vision model — its meaning lives in the visuals.""" + """Send the page to a vision model, its meaning lives in the visuals.""" @dataclass(frozen=True) diff --git a/src/localcontextrouter/ocr.py b/src/localcontextrouter/ocr.py index 45ffdc6..445d21d 100644 --- a/src/localcontextrouter/ocr.py +++ b/src/localcontextrouter/ocr.py @@ -120,7 +120,7 @@ def ocr_png_text( ) -> str: """OCR a PNG given as bytes; return the recognized lines joined by newlines. - Lines below ``min_confidence`` are dropped — useful for filtering the + Lines below ``min_confidence`` are dropped, useful for filtering the low-confidence glyphs that icons and logos tend to produce. """ with tempfile.NamedTemporaryFile(suffix=".png") as tmp: diff --git a/src/localcontextrouter/router.py b/src/localcontextrouter/router.py index 1869bdf..4761eda 100644 --- a/src/localcontextrouter/router.py +++ b/src/localcontextrouter/router.py @@ -1,7 +1,7 @@ """Route each PDF page to the cheapest faithful source: text, OCR, or vision. - Digital pages keep their extracted text, unless their meaning lives in visuals - (tables, charts, diagrams) — those go to a vision model. + (tables, charts, diagrams), those go to a vision model. - Scanned or garbled pages are rendered and sent to OCR. Every page carries a token estimate so the savings of avoiding the image path diff --git a/src/localcontextrouter/text.py b/src/localcontextrouter/text.py index ebf8a95..fc8295c 100644 --- a/src/localcontextrouter/text.py +++ b/src/localcontextrouter/text.py @@ -1,6 +1,6 @@ """Text normalization for routed output. -Applied to the text a page contributes to the model — not before +Applied to the text a page contributes to the model, not before classification, which relies on seeing control and replacement characters to spot a broken text layer. """