fix: crash on inputs containing runs of 4 spaces by lukors · Pull Request #8 · admk/sembr

lukors · 2026-04-20T09:52:14Z

fixes: #7

The processors register their replace_tokens values with the tokenizer via tokenizer.add_tokens(). The 4-space string used for the "\t" -> spaces substitution seems like it has no row in the sembr2023 model's embedding matrix. This means any input with four or more consecutive spaces therefore produces an out-of-range token ID and crashes with:

IndexError: index out of range in self

Reproducer:

printf 'hello    world' | uvx sembr

The fix in this PR is to not register the 4-space string with the tokenizer. Since its registration was a result of being a value in replace_tokens, I removed it from the dict and handle the tab-to-spaces substitution separately:

MarkdownProcessor, LaTeXProcessor, PlainTextProcessor: drop the "\t": " " * self.spaces entry from _get_replace_tokens.
MarkdownProcessor.parse_text: add an explicit text.replace("\t", " " * self.spaces). The other two processors already did this substitution.
BaseProcessor: drop the now-dead "if k != '\t'" filter from reverse_replace_tokens.

- MarkdownProcessor, LaTeXProcessor, PlainTextProcessor: drop the "\t": " " * self.spaces entry from _get_replace_tokens. - MarkdownProcessor.parse_text: add an explicit text.replace("\t", " " * self.spaces). The other two processors already did this substitution explicitly. - BaseProcessor: drop the now-dead "if k != '\t'" filter from reverse_replace_tokens.

admk · 2026-05-21T15:58:52Z

Thanks, merged. I have not tested it though, hope it fixes the bug and doesn't have regressions. I hope to plan to add tests later...

admk merged commit 98f97fb into admk:main May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: crash on inputs containing runs of 4 spaces#8

fix: crash on inputs containing runs of 4 spaces#8
admk merged 1 commit into
admk:mainfrom
lukors:lukors/fix_spaces_crash

lukors commented Apr 20, 2026

Uh oh!

admk commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukors commented Apr 20, 2026

Uh oh!

admk commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

admk commented May 21, 2026 •

edited

Loading