Fix GPU assignment for Slurm-launched Ray clusters by agolajko · Pull Request #1592 · NovaSky-AI/SkyRL

agolajko · 2026-04-29T21:13:53Z

Problem

DistributedTorchRayActor.__init__ assumed Ray always narrows CUDA_VISIBLE_DEVICES to a single GPU per actor, making LOCAL_RANK="0" safe. On Slurm clusters, Ray inherits CUDA_VISIBLE_DEVICES="0,1,...,7" from Slurm and never narrows it, so every worker calls set_device(0) and piles onto physical GPU 0, causing silent OOMs.

Fix

os.environ["CUDA_VISIBLE_DEVICES"] = str(ray.get_gpu_ids()[0])
os.environ["LOCAL_RANK"] = "0"

We explicitly narrow CUDA_VISIBLE_DEVICES to the Ray-assigned physical GPU, making LOCAL_RANK="0" unconditionally correct. This replaces the ray_noset_visible_devices() ternary which only checked whether the user disabled Ray's narrowing, not whether Ray actually performed it.

Scenarios

	CVD before init	CVD after fix	LOCAL_RANK
Ray default	`"3"` (Ray narrowed)	`"3"` (no-op)	`"0"` ✓
Ray noset	`"0,1,...,7"`	`"3"`	`"0"` ✓
Slurm	`"0,1,...,7"`	`"3"`	`"0"` ✓

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

update CVD and LOCAL_RANK assignment

c48e3fc

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Apr 29, 2026

View reviewed changes

agolajko added 2 commits April 29, 2026 14:37

handle empty get_gpu_ids list

b2d723c

rerun CI for test_qwen3 AssertionError

97d84a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU assignment for Slurm-launched Ray clusters#1592

Fix GPU assignment for Slurm-launched Ray clusters#1592
agolajko wants to merge 3 commits intoNovaSky-AI:mainfrom
agolajko:cuda_devices

agolajko commented Apr 29, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agolajko commented Apr 29, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Scenarios

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agolajko commented Apr 29, 2026 •

edited by devin-ai-integration Bot

Loading