fix: strip control characters from routed page text by sid732 · Pull Request #9 · sid732/LocalContextRouter

sid732 · 2026-06-23T00:40:13Z

Some PDFs encode discretionary hyphens (and similar artifacts) as control characters such as U+0002. They survived text extraction and leaked into the model's input — e.g. a resume produced observ�ability and credential�abuse.

Fix

Add clean_text: strips control characters (keeping newlines and tabs) and collapses \r\n/\r to \n.
The router applies it to the text each page contributes. Classification still runs on the raw text layer, so garbled-page detection (which depends on seeing control and replacement characters) is unaffected.

Tests

Unit tests for control-character stripping, line-ending normalization, and that clean text and Unicode punctuation (bullets, accents) survive; a router test asserts the emitted text equals clean_text of the raw layer.

Verified on a real resume: observ�ability -> observability, no CR or control characters in the output.

Verified locally: ruff, ruff format, mypy (strict), pytest (55) all pass.

Some PDFs encode discretionary hyphens and similar artifacts as control characters (e.g. U+0002) that leaked into the model's input. Normalize the text each page contributes — drop control characters, keep newlines and tabs, collapse CR/CRLF — while classification still runs on the raw layer so garbled detection is unaffected.

Unit-test control-character stripping, line-ending normalization, and that clean text and Unicode punctuation survive; assert the router emits cleaned text while classifying the raw layer.

sid732 added 3 commits June 22, 2026 20:39

test(core): cover text normalization

1a8cefb

Unit-test control-character stripping, line-ending normalization, and that clean text and Unicode punctuation survive; assert the router emits cleaned text while classifying the raw layer.

docs: note text normalization in the changelog

496996f

sid732 merged commit 52f4828 into main Jun 23, 2026
6 checks passed

sid732 deleted the fix/clean-text branch June 23, 2026 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: strip control characters from routed page text#9

fix: strip control characters from routed page text#9
sid732 merged 3 commits into
mainfrom
fix/clean-text

sid732 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sid732 commented Jun 23, 2026

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant