fix(patterns): anchor MP2 regex to prevent catastrophic backtracking#154
fix(patterns): anchor MP2 regex to prevent catastrophic backtracking#154mimran-khan wants to merge 3 commits into
Conversation
keshprad
left a comment
There was a problem hiding this comment.
Requesting changes for the lint failure and the MP2 repeated-token false negative.
| import signal | ||
| import sys |
There was a problem hiding this comment.
make lint currently fails here because signal and sys are imported but unused (ruff F401). Please remove these imports before merge.
There was a problem hiding this comment.
Pls also run make lint to verify any other fixes needed
| for pattern, confidence in MP2_PATTERNS: | ||
| for match in re.finditer(pattern, content, re.IGNORECASE | re.MULTILINE): | ||
| captured = match.group(1) if match.lastindex else match.group(0) | ||
| if len(set(captured.strip())) <= 1: |
There was a problem hiding this comment.
This filter seems broader than the intended markdown-separator suppression. Because it calls captured.strip() before counting unique characters, repeated whitespace-bearing stuffing like "x " * 30 is skipped as a single-character repetition, creating a false negative for MP2. Could we keep suppressing separators such as "=" * 80 without dropping repeated tokens that include spaces?
rng1995
left a comment
There was a problem hiding this comment.
Re-review
The ReDoS fix itself is correct — replacing ((\S)(?!\2).{1,19}?)\1{20,} with (.{2,20}?)\1{20,} removes the catastrophic-backtracking vector while preserving repeated-substring detection, and the bounded-time tests are a good addition.
However, the two items @keshprad raised are still open — no new commit has been pushed since that review (head is still d74083c2), so I'm keeping this at request changes:
- Lint break (ruff F401).
tests/nodes/analyzers/test_mp2_regex_backtracking.pystill importssignalandsys(top of file) but never uses them —make lintwill fail. Please remove them. - MP2 false negative from the new suppression filter. In
static_patterns_memory_poisoning.py:Callingif len(set(captured.strip())) <= 1: continue
.strip()before counting unique characters means space-bearing stuffing such as"x " * 30reduces to a single unique char ({"x"}) and is skipped — a genuine context-stuffing payload would be missed. Consider suppressing only true single-character separators (e.g."=" * 80) without stripping interior whitespace, so repeated multi-char tokens that include spaces are still caught.
rng1995
left a comment
There was a problem hiding this comment.
Re-review — both items from @keshprad's review are still open
I took a fresh look. The head is unchanged since @keshprad's review (d74083c2) and no new commit has been pushed to address the two blocking points, so I'm re-requesting changes to keep them tracked. Full credit to @keshprad for the original findings.
The core ReDoS fix is good — anchoring to (.{2,20}?)\1{20,} removes the nested negative-lookahead-inside-a-backreference-repeat and gives linear-time matching with equivalent detection. No concern there.
1. Lint failure (ruff F401). tests/nodes/analyzers/test_mp2_regex_backtracking.py still has unused imports:
import signal
import sysNeither is referenced anywhere in the file, so make lint fails. Please remove them (and run make lint to catch anything else).
2. MP2 false negative from the new suppression filter. In static_patterns_memory_poisoning.py:
captured = match.group(1) if match.lastindex else match.group(0)
if len(set(captured.strip())) <= 1:
continueCalling .strip() before counting unique characters means space-bearing stuffing like "x " * 30 collapses to a single unique character ({"x"}) and is skipped — a genuine context-window-stuffing payload is then missed. The intent (suppressing separators like "=" * 80) is reasonable, but the filter shouldn't discard repeated tokens that legitimately include whitespace. Consider counting uniqueness on the captured unit without stripping interior spaces.
Happy to re-approve once the lint break and the strip-based false negative are addressed.
The first MP2 pattern used a nested lazy quantifier with backreference
`((\S)(?!\2).{1,19}?)\1{20,}` which causes exponential backtracking on
non-matching inputs. Replace with a simpler `(.{2,20}?)\1{20,}` that
detects repeated substrings without the expensive negative lookahead
inside a repeated group.
Fixes NVIDIA#151
… false positives
The MP2 regex (.{2,20}?)\1{20,} matches single-character repetitions
like "=" * 80 (markdown separators). Add a post-match filter that skips
findings where the captured group contains only one unique character,
as these are formatting separators not context-window stuffing.
Also update test to use multi-char content ("ABCD") instead of
single-char repetition ("AAAA") which is correctly excluded now.
- Remove unused `signal` and `sys` imports (ruff F401) - Replace `.strip()` before unique-char counting with whitespace-aware check: single-char separators like "=" * 80 are still suppressed, but repeated tokens containing interior whitespace (e.g. "x " * 30) are now correctly detected as context-stuffing payloads - Add regression tests for both separator suppression and whitespace- bearing stuffing detection Addresses review feedback from @keshprad and @rng1995. Signed-off-by: mimran-khan <mohammed_imran.khan@outlook.com>
d74083c to
c34f442
Compare
|
Both issues addressed:
Branch rebased on latest |
Summary
The first MP2 pattern (Context Window Stuffing) uses a nested structure
((\S)(?!\2).{1,19}?)\1{20,}that is vulnerable to catastrophic backtracking (ReDoS). On non-matching inputs of moderate size (>500 chars), the regex engine explores exponential backtracking paths due to the lazy.{1,19}?quantifier inside a backreference-repeated group.Changes
static_patterns_memory_poisoning.py: Replace the vulnerable pattern with(.{2,20}?)\1{20,}which detects repeated substrings without the expensive negative lookahead inside a repeated group.tests/nodes/analyzers/test_mp2_regex_backtracking.py— validates both correct detection of repeated content and bounded execution time on adversarial inputs.Before / After
Testing
test_llm_utils.py)Fixes #151