Skip to content

fix(patterns): anchor MP2 regex to prevent catastrophic backtracking#154

Open
mimran-khan wants to merge 3 commits into
NVIDIA:mainfrom
mimran-khan:fix/mp2-regex-backtracking
Open

fix(patterns): anchor MP2 regex to prevent catastrophic backtracking#154
mimran-khan wants to merge 3 commits into
NVIDIA:mainfrom
mimran-khan:fix/mp2-regex-backtracking

Conversation

@mimran-khan

Copy link
Copy Markdown
Contributor

Summary

The first MP2 pattern (Context Window Stuffing) uses a nested structure ((\S)(?!\2).{1,19}?)\1{20,} that is vulnerable to catastrophic backtracking (ReDoS). On non-matching inputs of moderate size (>500 chars), the regex engine explores exponential backtracking paths due to the lazy .{1,19}? quantifier inside a backreference-repeated group.

Changes

  • static_patterns_memory_poisoning.py: Replace the vulnerable pattern with (.{2,20}?)\1{20,} which detects repeated substrings without the expensive negative lookahead inside a repeated group.
  • New test file: tests/nodes/analyzers/test_mp2_regex_backtracking.py — validates both correct detection of repeated content and bounded execution time on adversarial inputs.

Before / After

# Before — exponential backtracking on non-matching input
(r"((\S)(?!\2).{1,19}?)\1{20,}", 0.8)

# After — linear-time matching, same detection capability
(r"(.{2,20}?)\1{20,}", 0.8)

Testing

  • 6 new tests: 3 for correct detection, 3 for bounded execution time
  • Full test suite passes (775 tests, excluding pre-existing upstream issue in test_llm_utils.py)

Fixes #151

@keshprad keshprad left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the lint failure and the MP2 repeated-token false negative.

Comment on lines +20 to +21
import signal
import sys

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make lint currently fails here because signal and sys are imported but unused (ruff F401). Please remove these imports before merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls also run make lint to verify any other fixes needed

for pattern, confidence in MP2_PATTERNS:
for match in re.finditer(pattern, content, re.IGNORECASE | re.MULTILINE):
captured = match.group(1) if match.lastindex else match.group(0)
if len(set(captured.strip())) <= 1:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter seems broader than the intended markdown-separator suppression. Because it calls captured.strip() before counting unique characters, repeated whitespace-bearing stuffing like "x " * 30 is skipped as a single-character repetition, creating a false negative for MP2. Could we keep suppressing separators such as "=" * 80 without dropping repeated tokens that include spaces?

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review

The ReDoS fix itself is correct — replacing ((\S)(?!\2).{1,19}?)\1{20,} with (.{2,20}?)\1{20,} removes the catastrophic-backtracking vector while preserving repeated-substring detection, and the bounded-time tests are a good addition.

However, the two items @keshprad raised are still open — no new commit has been pushed since that review (head is still d74083c2), so I'm keeping this at request changes:

  1. Lint break (ruff F401). tests/nodes/analyzers/test_mp2_regex_backtracking.py still imports signal and sys (top of file) but never uses them — make lint will fail. Please remove them.
  2. MP2 false negative from the new suppression filter. In static_patterns_memory_poisoning.py:
    if len(set(captured.strip())) <= 1:
        continue
    Calling .strip() before counting unique characters means space-bearing stuffing such as "x " * 30 reduces to a single unique char ({"x"}) and is skipped — a genuine context-stuffing payload would be missed. Consider suppressing only true single-character separators (e.g. "=" * 80) without stripping interior whitespace, so repeated multi-char tokens that include spaces are still caught.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — both items from @keshprad's review are still open

I took a fresh look. The head is unchanged since @keshprad's review (d74083c2) and no new commit has been pushed to address the two blocking points, so I'm re-requesting changes to keep them tracked. Full credit to @keshprad for the original findings.

The core ReDoS fix is good — anchoring to (.{2,20}?)\1{20,} removes the nested negative-lookahead-inside-a-backreference-repeat and gives linear-time matching with equivalent detection. No concern there.

1. Lint failure (ruff F401). tests/nodes/analyzers/test_mp2_regex_backtracking.py still has unused imports:

import signal
import sys

Neither is referenced anywhere in the file, so make lint fails. Please remove them (and run make lint to catch anything else).

2. MP2 false negative from the new suppression filter. In static_patterns_memory_poisoning.py:

captured = match.group(1) if match.lastindex else match.group(0)
if len(set(captured.strip())) <= 1:
    continue

Calling .strip() before counting unique characters means space-bearing stuffing like "x " * 30 collapses to a single unique character ({"x"}) and is skipped — a genuine context-window-stuffing payload is then missed. The intent (suppressing separators like "=" * 80) is reasonable, but the filter shouldn't discard repeated tokens that legitimately include whitespace. Consider counting uniqueness on the captured unit without stripping interior spaces.

Happy to re-approve once the lint break and the strip-based false negative are addressed.

The first MP2 pattern used a nested lazy quantifier with backreference
`((\S)(?!\2).{1,19}?)\1{20,}` which causes exponential backtracking on
non-matching inputs. Replace with a simpler `(.{2,20}?)\1{20,}` that
detects repeated substrings without the expensive negative lookahead
inside a repeated group.

Fixes NVIDIA#151
… false positives

The MP2 regex (.{2,20}?)\1{20,} matches single-character repetitions
like "=" * 80 (markdown separators). Add a post-match filter that skips
findings where the captured group contains only one unique character,
as these are formatting separators not context-window stuffing.

Also update test to use multi-char content ("ABCD") instead of
single-char repetition ("AAAA") which is correctly excluded now.
- Remove unused `signal` and `sys` imports (ruff F401)
- Replace `.strip()` before unique-char counting with whitespace-aware
  check: single-char separators like "=" * 80 are still suppressed, but
  repeated tokens containing interior whitespace (e.g. "x " * 30) are
  now correctly detected as context-stuffing payloads
- Add regression tests for both separator suppression and whitespace-
  bearing stuffing detection

Addresses review feedback from @keshprad and @rng1995.

Signed-off-by: mimran-khan <mohammed_imran.khan@outlook.com>
@mimran-khan mimran-khan force-pushed the fix/mp2-regex-backtracking branch from d74083c to c34f442 Compare June 23, 2026 09:38
@mimran-khan

Copy link
Copy Markdown
Contributor Author

Both issues addressed:

  1. Lint fixed — removed unused signal and sys imports from test_mp2_regex_backtracking.py. ruff check passes clean.

  2. Whitespace-bearing stuffing false negative fixed — replaced len(set(captured.strip())) <= 1 with a whitespace-aware check: we now compute unique non-whitespace chars and only suppress when both (a) there's at most one unique non-whitespace char AND (b) no interior whitespace is present. This means:

    • "=" * 80 → suppressed ✓ (single non-ws char, no spaces)
    • "x " * 30 → detected ✓ (single non-ws char, but has interior whitespace = genuine stuffing)
    • "ab" * 30 → detected ✓ (multiple non-ws chars)
  3. Tests added: test_separator_line_not_detected (separators suppressed) and test_whitespace_bearing_stuffing_detected (space-bearing payloads not suppressed).

Branch rebased on latest main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] MP2 regex vulnerable to catastrophic backtracking (ReDoS)

3 participants