fix(onboard): restore pre-patch sandbox when GPU recreate fails early (#5512)#5541
Open
abhi-0906 wants to merge 1 commit into
Open
fix(onboard): restore pre-patch sandbox when GPU recreate fails early (#5512)#5541abhi-0906 wants to merge 1 commit into
abhi-0906 wants to merge 1 commit into
Conversation
…NVIDIA#5512) When the Docker GPU patch recreate `docker run` fails after the original sandbox was already renamed to `*-nemoclaw-gpu-backup-<timestamp>`, the early-failure path removed only the failed new container and left the backup orphaned — stranding the sandbox with no live original and colliding on the next retry. Reuse the existing rollback primitive on this path: remove the failed new container, rename the backup back to the original name, and start it, so onboarding restores the pre-patch sandbox instead of leaking a backup container. Adds rollbackDockerGpuPatchOnRecreateFailure to the finalize module (resolving the real docker start/rename defaults) and records context.rolledBack for diagnostics. Follow-up to NVIDIA#5537 (which makes the patch succeed on Docker Desktop WSL, so this path is no longer hit there) addressing the orphan-backup symptom noted in NVIDIA#5512. Signed-off-by: Abhimanyu Kumar <abhimanyukumar7290@gmail.com>
Contributor
📝 WalkthroughWalkthroughAdds GPU-patch recreate rollback
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This was referenced Jun 17, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #5537 addressing the orphan-backup symptom in #5512. When the Docker GPU patch's recreate
docker runfails after the original sandbox was already renamed to*-nemoclaw-gpu-backup-<timestamp>, the early-failure path removed only the failed new container and left the backup orphaned — stranding the sandbox with no live original, and colliding with*-nemoclaw-gpu-backup-*on the next retry (as reported in #5512).The supervisor-reconnect failure path already rolls back to the backup; this path didn't.
Fix
Reuse the existing rollback primitive (
rollbackToBackupContainer) on the early-failure path: remove the failed new container, rename the backup back to the original name, and start it — restoring the pre-patch sandbox instead of leaking a backup container.rollbackDockerGpuPatchOnRecreateFailure(refs, deps)todocker-gpu-patch-finalize.ts, which resolves the realdocker start/docker renamedefaults (the recreate call path only carries a deps subset, sodockerStartwould otherwise be unset).context.rolledBackfor failure diagnostics, matching the reconnect-failure path.onboard.tschange; all edits are undersrc/lib/onboard/.Testing
docker-gpu-patch-rollback.test.ts: whendockerRunDetachedfails, the backup is renamed back to the original and started, and is never left as an orphaned container.tsc -p tsconfig.src.jsonclean; rollback / finalize / sandbox-create suites pass (18/18).Relationship to the WSL Docker Desktop chain
[2/8][6/8](makes the patch succeed on Docker Desktop WSL, so this early-failure path is no longer hit there)Refs #5512.
Summary by CodeRabbit
Bug Fixes
Tests
Signed-off-by: Abhimanyu Kumar abhimanyukumar7290@gmail.com