Skip to content

Fix GPU assignment for Slurm-launched Ray clusters#1592

Open
agolajko wants to merge 3 commits intoNovaSky-AI:mainfrom
agolajko:cuda_devices
Open

Fix GPU assignment for Slurm-launched Ray clusters#1592
agolajko wants to merge 3 commits intoNovaSky-AI:mainfrom
agolajko:cuda_devices

Conversation

@agolajko
Copy link
Copy Markdown
Contributor

@agolajko agolajko commented Apr 29, 2026

Fixes #1577

Problem

DistributedTorchRayActor.__init__ assumed Ray always narrows CUDA_VISIBLE_DEVICES to a single GPU per actor, making LOCAL_RANK="0" safe. On Slurm clusters, Ray inherits CUDA_VISIBLE_DEVICES="0,1,...,7" from Slurm and never narrows it, so every worker calls set_device(0) and piles onto physical GPU 0, causing silent OOMs.

Fix

os.environ["CUDA_VISIBLE_DEVICES"] = str(ray.get_gpu_ids()[0])
os.environ["LOCAL_RANK"] = "0"

We explicitly narrow CUDA_VISIBLE_DEVICES to the Ray-assigned physical GPU, making LOCAL_RANK="0" unconditionally correct. This replaces the ray_noset_visible_devices() ternary which only checked whether the user disabled Ray's narrowing, not whether Ray actually performed it.

Scenarios

CVD before init CVD after fix LOCAL_RANK
Ray default "3" (Ray narrowed) "3" (no-op) "0"
Ray noset "0,1,...,7" "3" "0"
Slurm "0,1,...,7" "3" "0"

Open in Devin Review

gemini-code-assist[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SkyRL silently binds every FSDP policy worker to GPU 0 when Ray doesn't isolate CUDA_VISIBLE_DEVICES

1 participant