fix: strip control characters from routed page text#9
Merged
Conversation
Some PDFs encode discretionary hyphens and similar artifacts as control characters (e.g. U+0002) that leaked into the model's input. Normalize the text each page contributes — drop control characters, keep newlines and tabs, collapse CR/CRLF — while classification still runs on the raw layer so garbled detection is unaffected.
Unit-test control-character stripping, line-ending normalization, and that clean text and Unicode punctuation survive; assert the router emits cleaned text while classifying the raw layer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some PDFs encode discretionary hyphens (and similar artifacts) as control characters such as U+0002. They survived text extraction and leaked into the model's input — e.g. a resume produced
observ�abilityandcredential�abuse.Fix
clean_text: strips control characters (keeping newlines and tabs) and collapses\r\n/\rto\n.Tests
Unit tests for control-character stripping, line-ending normalization, and that clean text and Unicode punctuation (bullets, accents) survive; a router test asserts the emitted text equals
clean_textof the raw layer.Verified on a real resume:
observ�ability->observability, no CR or control characters in the output.Verified locally: ruff, ruff format, mypy (strict), pytest (55) all pass.