Skip to content

Fix intermittent dali failure#1687

Open
ktangsali wants to merge 2 commits into
NVIDIA:mainfrom
ktangsali:fix-intermittent-issues
Open

Fix intermittent dali failure#1687
ktangsali wants to merge 2 commits into
NVIDIA:mainfrom
ktangsali:fix-intermittent-issues

Conversation

@ktangsali
Copy link
Copy Markdown
Collaborator

@ktangsali ktangsali commented May 29, 2026

PhysicsNeMo Pull Request

Description

check_cuda_graphs called next(iter(datapipe)) inside its warmup/record/replay loops, which re-entered Datapipe.__iter__ every step and reset the underlying DALI pipeline 8 times per test, exposing a race in DALI's multiprocessing-pool _observer_thread that intermittently aborted the interpreter.

Fix: build the iterator once outside the loops (data_iter = iter(datapipe)) and advance it with next(data_iter), collapsing 8 resets into 1 and closing the race window.

Test: earlier pytest test/datapipes would fail ~5/16 times when I tested locally. Post fix, was able to get 16/16 pass rate.

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ktangsali ktangsali requested a review from peterdsharpe May 29, 2026 18:18
@ktangsali ktangsali marked this pull request as ready for review May 29, 2026 18:19
@ktangsali ktangsali requested a review from coreyjadams as a code owner May 29, 2026 18:19
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR fixes an intermittent SIGABRT in DALI-backed datapipe tests by building the datapipe iterator once (data_iter = iter(datapipe)) instead of calling next(iter(datapipe)) inside each loop body, which was re-entering Datapipe.__iter__ — and for DALI pipelines resetting self.pipe and reconstructing a DALIGenericIterator — up to 8 times per test, widening a race in DALI's _observer_thread.

  • Core fix: data_iter = iter(datapipe) is created once before the warmup and record/replay loops; both loops now advance the same iterator with next(data_iter).
  • Behavioral note: Previously, each next(iter(datapipe)) always returned the first batch of a freshly-reset iterator; now sequential batches are consumed. This is intentional and correct for the test's purpose — CUDA graph capture requires stable memory layout, not repeated identical input.
  • Scope: Change is confined to the test utility check_cuda_graphs; no production code is modified.

Important Files Changed

Filename Overview
test/datapipes/common/cuda_graphs.py Builds the datapipe iterator once outside the warmup and record/replay loops to prevent repeated DALI pipeline resets that were triggering a race condition in DALI's multiprocessing observer thread.

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix-intermitten..." | Re-trigger Greptile

@ktangsali
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@peterdsharpe peterdsharpe changed the title Fx intermittent dali failure Fix intermittent dali failure May 29, 2026
Copy link
Copy Markdown
Collaborator

@peterdsharpe peterdsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktangsali are you able to consistently reproduce this, or is it intermittent? If it's consistently reproducible, can we add a non-regression test as part of the test suite here?

@ktangsali
Copy link
Copy Markdown
Collaborator Author

@ktangsali are you able to consistently reproduce this, or is it intermittent? If it's consistently reproducible, can we add a non-regression test as part of the test suite here?

Yes, I did test it for intermittency.

Test: earlier pytest test/datapipes would fail ~5/16 times when I tested locally. Post fix, was able to get 16/16 pass rate.

What do you mean by non-regression test here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants