fix(onboard): skip CDI GPU mode on Docker Desktop WSL (#5512)#5537
fix(onboard): skip CDI GPU mode on Docker Desktop WSL (#5512)#5537abhi-0906 wants to merge 2 commits into
Conversation
On Docker Desktop + WSL2, onboard's [6/8] Docker GPU patch recreates the sandbox with `--device nvidia.com/gpu=all` (CDI) and fails with "CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all", even though preflight already commits to the `--gpus` compatibility path. Docker Desktop advertises CDI spec directories, so dockerReportsNvidiaCdiDevices() returns true and buildDockerGpuModeCandidates offers CDI first; the create-only probe passes but the real recreate fails because the WSL distro exposes no usable nvidia.com/gpu spec. Thread the existing Docker Desktop WSL detection (isDockerDesktopWslRuntime, already used to gate the patch) through selectDockerGpuPatchMode into buildDockerGpuModeCandidates, and skip the CDI candidate when on Docker Desktop WSL so the patch uses `--gpus all`. Native Docker-CDI hosts are unaffected and still prefer CDI (preserving the NVIDIA#4948 gateway supervisor-wiring contract). Reached only after the [2/8] gateway-bind issue (NVIDIA#5513 / NVIDIA#5534). A follow-up is still needed for the orphaned `*-nemoclaw-gpu-backup-*` container left behind on an early patch failure. Signed-off-by: Abhimanyu Kumar <abhimanyukumar7290@gmail.com>
|
@abhi-0906 can you add a DCO 'Signed-off-by' to the PR description, please? |
📝 WalkthroughWalkthroughAdds a ChangesDocker Desktop WSL CDI skip in GPU patch flow
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
On Docker Desktop + WSL2 with an NVIDIA GPU, onboard's
[6/8]Docker GPU patch recreates the sandbox container with--device nvidia.com/gpu=all(CDI syntax) and fails:even though preflight already logs that it will use the
--gpuscompatibility path. The only workaround today is--no-gpu/NEMOCLAW_SANDBOX_GPU=0, which disables GPU entirely.Root cause
Docker Desktop advertises CDI spec directories, so
dockerReportsNvidiaCdiDevices()returns true andbuildDockerGpuModeCandidates()offers CDI as the first candidate. The create-only probe (docker create … true) passes, but the real recreate fails because the WSL distro exposes no usablenvidia.com/gpuspec. The Docker Desktop WSL status was detected at preflight but never reached the mode selector —selectDockerGpuPatchModeonly received{image, device, backend}.PR #5198 (which closed #5180) added the CDI-injection failure classification, the
--no-gpurecovery hint, and the warning thatNEMOCLAW_DOCKER_GPU_PATCH=0is ignored on this runtime — but it did not change mode selection. This is the unaddressed root cause.Fix
Thread the existing Docker Desktop WSL detection (
isDockerDesktopWslRuntime(), already used to gate the patch) throughselectDockerGpuPatchModeintobuildDockerGpuModeCandidates, and skip the CDI candidate when on Docker Desktop WSL so the patch uses--gpus all— the path preflight already commits to.docker-gpu-sandbox-create.ts, so no change toonboard.tsand no extradocker infocalls.Testing
docker-gpu-patch-wsl.test.ts: CDI is skipped (first candidate is--gpus all) whendockerDesktopWslis true even with CDI advertised, and CDI is still preferred otherwise.tsc -p tsconfig.src.jsonclean; GPU-patch suites pass (remaining failures are pre-existing Windows-only/etc/cdipath tests, identical onmain).Notes / follow-up
[2/8]gateway-bind issue ([WSL2][Policy&Network] OpenShell gateway unreachable from sandbox containers on Docker Desktop WSL (binds to 127.0.0.1:8080) #5513, fix in fix(onboard): bind gateway to 0.0.0.0 on Docker Desktop WSL (#5513) #5534).*-nemoclaw-gpu-backup-<timestamp>before container creation, and only the new container is removed — leaving an orphan backup. Happy to follow up with a focused PR for that cleanup.Fixes #5512.
Summary by CodeRabbit
Signed-off-by: Abhimanyu Kumar abhimanyukumar7290@gmail.com