Skip to content

fix(onboard): restore pre-patch sandbox when GPU recreate fails early (#5512)#5541

Open
abhi-0906 wants to merge 1 commit into
NVIDIA:mainfrom
abhi-0906:fix/gpu-patch-backup-cleanup-on-early-failure
Open

fix(onboard): restore pre-patch sandbox when GPU recreate fails early (#5512)#5541
abhi-0906 wants to merge 1 commit into
NVIDIA:mainfrom
abhi-0906:fix/gpu-patch-backup-cleanup-on-early-failure

Conversation

@abhi-0906

@abhi-0906 abhi-0906 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #5537 addressing the orphan-backup symptom in #5512. When the Docker GPU patch's recreate docker run fails after the original sandbox was already renamed to *-nemoclaw-gpu-backup-<timestamp>, the early-failure path removed only the failed new container and left the backup orphaned — stranding the sandbox with no live original, and colliding with *-nemoclaw-gpu-backup-* on the next retry (as reported in #5512).

The supervisor-reconnect failure path already rolls back to the backup; this path didn't.

Fix

Reuse the existing rollback primitive (rollbackToBackupContainer) on the early-failure path: remove the failed new container, rename the backup back to the original name, and start it — restoring the pre-patch sandbox instead of leaking a backup container.

  • Adds rollbackDockerGpuPatchOnRecreateFailure(refs, deps) to docker-gpu-patch-finalize.ts, which resolves the real docker start / docker rename defaults (the recreate call path only carries a deps subset, so dockerStart would otherwise be unset).
  • Records context.rolledBack for failure diagnostics, matching the reconnect-failure path.
  • No onboard.ts change; all edits are under src/lib/onboard/.

Testing

  • New composed test in docker-gpu-patch-rollback.test.ts: when dockerRunDetached fails, the backup is renamed back to the original and started, and is never left as an orphaned container.
  • tsc -p tsconfig.src.json clean; rollback / finalize / sandbox-create suites pass (18/18).

Relationship to the WSL Docker Desktop chain

Refs #5512.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced error recovery for Docker GPU patch operations. When the patch recreation step fails after the original container is renamed, the system now automatically rolls back to restore the pre-patch sandbox state instead of leaving it in a corrupted state.
  • Tests

    • Added test coverage for rollback scenarios during GPU patch recreation failures.

Signed-off-by: Abhimanyu Kumar abhimanyukumar7290@gmail.com

…NVIDIA#5512)

When the Docker GPU patch recreate `docker run` fails after the original
sandbox was already renamed to `*-nemoclaw-gpu-backup-<timestamp>`, the
early-failure path removed only the failed new container and left the
backup orphaned — stranding the sandbox with no live original and
colliding on the next retry.

Reuse the existing rollback primitive on this path: remove the failed new
container, rename the backup back to the original name, and start it, so
onboarding restores the pre-patch sandbox instead of leaking a backup
container. Adds rollbackDockerGpuPatchOnRecreateFailure to the finalize
module (resolving the real docker start/rename defaults) and records
context.rolledBack for diagnostics.

Follow-up to NVIDIA#5537 (which makes the patch succeed on Docker Desktop WSL,
so this path is no longer hit there) addressing the orphan-backup symptom
noted in NVIDIA#5512.

Signed-off-by: Abhimanyu Kumar <abhimanyukumar7290@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds rollbackDockerGpuPatchOnRecreateFailure to docker-gpu-patch-finalize.ts, wires it into the recreate-failure path of recreateOpenShellDockerSandboxWithGpu in docker-gpu-patch.ts (replacing a plain container cleanup), and adds a Vitest test asserting the full pre-patch sandbox restoration sequence.

GPU-patch recreate rollback

Layer / File(s) Summary
rollbackDockerGpuPatchOnRecreateFailure helper
src/lib/onboard/docker-gpu-patch-finalize.ts
Exports a new function that accepts raw DockerGpuPatchDeps, resolves full rollback deps via resolveRollbackDeps, and delegates stop/remove/rename/start to the existing rollbackToBackupContainer.
Callsite integration and test
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-patch-rollback.test.ts
Imports and calls rollbackDockerGpuPatchOnRecreateFailure in the recreate failure branch of recreateOpenShellDockerSandboxWithGpu, storing the boolean result in context.rolledBack; test asserts backup-to-original rename, ignoreError: true start, and that dockerRm is not called on the backup container.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

area: sandbox, bug-fix

Suggested reviewers

  • cv

Poem

🐇 When docker run goes awry mid-flight,
And the container rename leaves things a fright,
I hop back to the backup with care,
Rename, start, and leave no orphan there!
The sandbox restored, my warren secure — 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: restoring the pre-patch sandbox when GPU recreation fails early, which is the core issue addressed by this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant