Skip to content

fix: strip control characters from routed page text#9

Merged
sid732 merged 3 commits into
mainfrom
fix/clean-text
Jun 23, 2026
Merged

fix: strip control characters from routed page text#9
sid732 merged 3 commits into
mainfrom
fix/clean-text

Conversation

@sid732

@sid732 sid732 commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Some PDFs encode discretionary hyphens (and similar artifacts) as control characters such as U+0002. They survived text extraction and leaked into the model's input — e.g. a resume produced observ�ability and credential�abuse.

Fix

  • Add clean_text: strips control characters (keeping newlines and tabs) and collapses \r\n/\r to \n.
  • The router applies it to the text each page contributes. Classification still runs on the raw text layer, so garbled-page detection (which depends on seeing control and replacement characters) is unaffected.

Tests

Unit tests for control-character stripping, line-ending normalization, and that clean text and Unicode punctuation (bullets, accents) survive; a router test asserts the emitted text equals clean_text of the raw layer.

Verified on a real resume: observ�ability -> observability, no CR or control characters in the output.

Verified locally: ruff, ruff format, mypy (strict), pytest (55) all pass.

sid732 added 3 commits June 22, 2026 20:39
Some PDFs encode discretionary hyphens and similar artifacts as control
characters (e.g. U+0002) that leaked into the model's input. Normalize the
text each page contributes — drop control characters, keep newlines and tabs,
collapse CR/CRLF — while classification still runs on the raw layer so garbled
detection is unaffected.
Unit-test control-character stripping, line-ending normalization, and that
clean text and Unicode punctuation survive; assert the router emits cleaned
text while classifying the raw layer.
@sid732 sid732 merged commit 52f4828 into main Jun 23, 2026
6 checks passed
@sid732 sid732 deleted the fix/clean-text branch June 23, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant