Skip to content

fix(findings): deduplicate cross-analyzer findings before scoring#142

Open
mimran-khan wants to merge 2 commits into
NVIDIA:mainfrom
mimran-khan:fix/cross-analyzer-deduplication-137
Open

fix(findings): deduplicate cross-analyzer findings before scoring#142
mimran-khan wants to merge 2 commits into
NVIDIA:mainfrom
mimran-khan:fix/cross-analyzer-deduplication-137

Conversation

@mimran-khan

Copy link
Copy Markdown
Contributor

Summary

When multiple analyzers detect the same pattern in multiple files of a skill, each occurrence was treated as a completely independent finding. A skill that calls subprocess.run(cmd, shell=True) in 5 modules got 5 TM1 findings, each contributing full points to the risk score — even though it represents the same architectural decision observed in different places.

Changes

  • nodes/deduplicate.py: New module with a pure deduplicate() function implementing two-pass deduplication:
    1. Same-file dedup: identical (rule_id, file, matched_text) → keep highest confidence
    2. Cross-file dedup: identical (rule_id, matched_text) across files → keep highest confidence representative
  • nodes/report.py: Calls deduplicate() on findings before scoring and SARIF generation

Design Decisions

Decision Rationale
Dedup key uses first 100 chars of matched_text Prevents pathologically long matches from consuming memory while maintaining specificity
No cross-file dedup for empty matched_text Without matched text, there's no reliable identity signal
Dedup before scoring (in report node) Preserves raw findings for meta_analyzer LLM analysis, only deduplicates for final output
Sort output by severity then file Stable, predictable report ordering

Behavioral Impact

Scenario Before After
shell=True in 5 files, same pattern 5 × TM1 HIGH findings 1 × TM1 HIGH finding
Different subprocess calls in same file Multiple findings Multiple (different matched_text)
Same rule_id, different patterns All kept All kept (different dedup keys)
Findings without matched_text All kept All kept (no cross-file dedup)

Test Plan

  • 17 unit tests covering same-file dedup, cross-file dedup, no-matched-text preservation, edge cases (empty list, sorting, whitespace normalization, truncation, mixed scenarios, real-world simulation)
  • Full test suite: 693 passed, 0 failures, 0 regressions
  • ruff check and ruff format clean

Fixes #137

@mimran-khan

Copy link
Copy Markdown
Contributor Author

PS: These issues were found while running bunch of skill evals using SkillSpector and hence raising all the issues so that i can use these with correct reference to Upstream code

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: deduplicate cross-analyzer findings before scoring

The deduplicate function itself is correct and very well tested (highest-confidence retention, whitespace normalization, no-matched-text passthrough, stable sort). My concern is how it's wired into report(), so I'm requesting changes.

Blocking: cross-file dedup removes findings from the report, not just from scoring

In report() the deduped list is assigned to filtered_findings and fed to both _compute_risk_score and _build_sarif:

findings = deduplicate(raw_findings)
filtered_findings = findings

The cross-file pass keys on (rule_id, matched_text[:100]) and ignores file, so the same matched pattern across N files collapses to a single finding with a single location. Consequences for a security scanner:

  • The emitted SARIF/JSON shows one affected file; the other N-1 files are silently unrepresented. A consumer (e.g., SARIF code scanning) only annotates one location, which can leave genuinely affected files un-remediated.
  • The PR title scopes this to "before scoring," but the implementation also mutates the reported findings — broader than advertised.
  • "cross-analyzer" in the title implies merging duplicate reports from different analyzers, but the key is a single rule_id, so this actually collapses one rule's matches across files.

Recommendations (either is fine)

  1. Apply dedup to the score computation only and keep the full findings list in the report; or
  2. Aggregate collapsed instances into the surviving finding (retain all file locations / an occurrence count) instead of discarding them, so no affected-file information is lost.

Minor

  • The sort uses severity_order.get(f.severity, 4) without normalizing case. A non-uppercase severity sorts last; use f.severity.upper() to match the rest of the codebase (e.g. _compute_risk_score).
  • matched_text[:100] means two distinct payloads sharing a 100-char prefix collapse cross-file — acceptable as noise reduction but worth a comment given the security context.

Inter-PR note

PR #139 (per-rule diminishing returns) targets the same repeated-pattern saturation via a different mechanism and edits report.py from the same base. Once both land, dedup runs first, so #139's diminishing weights rarely trigger for identical patterns (partial redundancy). Please confirm the combined behavior is intended.

@mimran-khan

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! I've addressed the blocking concern:

Cross-file dedup now only affects scoring, not the report.

  • deduplicate() is applied only to the findings list passed to _compute_risk_score()
  • The full (non-deduped) findings list is used for SARIF, JSON, markdown, and terminal output — all affected file locations are preserved
  • Added a test (test_report_dedup_affects_score_only_not_report_output) that verifies 4 files with the same pattern all appear in the report while the score remains lower than 4x base points

Also fixed the case sensitivity issue on the severity sort (f.severity.upper()).

The matched_text[:100] prefix collision is kept as-is since it's an acceptable noise reduction tradeoff — added a note in the code comment.

When multiple analyzers detect the same pattern (e.g. subprocess.run with
shell=True) in multiple files of a skill, each occurrence was treated as
a completely independent finding. A skill calling subprocess in 5 modules
got 5 findings, each contributing full points — even though it represents
the same design choice observed in different places.

This commit adds a two-pass deduplication step in the report node:
1. Same-file dedup: identical (rule_id, file, matched_text) collapsed to
   the highest-confidence instance
2. Cross-file dedup: identical (rule_id, matched_text) across different
   files collapsed to one representative finding

Findings without matched_text are never cross-file deduplicated since
they lack a reliable identity signal. Output is sorted by severity then
file for stable, predictable reports.

Fixes NVIDIA#137

Signed-off-by: Mohammed Imran Khan <mohammed_imran.khan@outlook.com>
…ll findings in report

Cross-file deduplication now only affects risk score calculation.
The full findings list (with all affected file locations) is preserved
in SARIF, JSON, markdown, and terminal output so that consumers see
every affected file rather than a single representative.

Also normalizes severity case in the sort key to handle non-uppercase
severity strings consistently.
@mimran-khan mimran-khan force-pushed the fix/cross-analyzer-deduplication-137 branch from 4163b16 to e6620ea Compare June 22, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Duplicate findings from parallel analyzers are never deduplicated

2 participants