Skip to content

fix(onboard): stop the gateway when its sandbox-bridge probe fails (#5513)#5536

Open
abhi-0906 wants to merge 1 commit into
NVIDIA:mainfrom
abhi-0906:fix/orphaned-gateway-cleanup-on-probe-failure
Open

fix(onboard): stop the gateway when its sandbox-bridge probe fails (#5513)#5536
abhi-0906 wants to merge 1 commit into
NVIDIA:mainfrom
abhi-0906:fix/orphaned-gateway-cleanup-on-probe-failure

Conversation

@abhi-0906

@abhi-0906 abhi-0906 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #5534 (the Docker Desktop WSL gateway bind fix). When onboard's [2/8] sandbox-bridge reachability probe fails, NemoClaw aborts via process.exit(1) without stopping the OpenShell gateway it started (or reused/adopted) earlier in the same run. The gateway is left running, bound to the loopback address — so the accompanying "restart Docker and re-run" hint is misleading: the stale listener survives a Docker restart and collides with the next attempt (the orphaned-gateway symptom called out in #5513).

Fix

  • Add an onUnreachable hook to verifySandboxBridgeGatewayReachableOrExit that fires only on a genuine unreachable result — not the soft probe_unavailable skip, and not a successful probe.
  • Wire it at the three host-mode gateway paths in startDockerDriverGateway (fresh start, reuse, adopt) to tear the gateway down via the existing stopDockerDriverGatewayProcess(). That helper reads the pid file written in every path and only terminates a verified gateway process, so it is a safe no-op otherwise.

Relationship to #5534: #5534 stops the probe from failing on Docker Desktop WSL in the first place (bind 0.0.0.0); this PR ensures that any genuine probe failure — on any host — no longer orphans the gateway, and makes the "re-run onboard" guidance accurate.

Testing

  • New unit tests in gateway-sandbox-reachability.test.ts: onUnreachable fires on a genuine unreachable probe, and not on a successful probe or a soft probe_unavailable result.
  • tsc -p tsconfig.src.json clean; reachability + onboard-runtime suites pass (remaining failures are pre-existing Windows-only path / CLI-not-found tests, identical on main).

Refs #5513.

Summary by CodeRabbit

Bug Fixes

  • Improved onboarding behavior when sandbox-bridge reachability fails: the system now triggers a cleanup step before aborting, preventing leftover/reused gateway processes from remaining active.

Tests

  • Added coverage for gateway reachability verification, including ensuring the new “unreachable” handler runs for genuine failures but not for successful or soft “probe unavailable” outcomes.

Signed-off-by: Abhimanyu Kumar abhimanyukumar7290@gmail.com

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an optional onUnreachable callback to SandboxBridgeVerifierOptions that fires immediately before verifySandboxBridgeGatewayReachableOrExit aborts on genuine probe failure. A shared sandboxBridgeProbeOptions object is defined in startDockerDriverGateway() with a callback that stops the Docker-driver gateway process, then wired into all three gateway verification paths (reuse, adopt, healthy). Tests validate the callback's invocation semantics.

Changes

Gateway orphan cleanup on sandbox-bridge unreachability

Layer / File(s) Summary
onUnreachable callback contract
src/lib/onboard/gateway-sandbox-reachability.ts
SandboxBridgeVerifierOptions gains onUnreachable?: () => void; the function invokes it before aborting on genuine reachability failures, skipping it for probe_unavailable soft results.
Shared probe options and wiring into three gateway verification paths
src/lib/onboard.ts
Defines sandboxBridgeProbeOptions object with skip and onUnreachable callback (that stops the Docker-driver gateway process), then passes it to verifySandboxBridgeGatewayReachableOrExit in the reuse, adopt, and healthy-probe verification call sites.
Unit tests for callback semantics
src/lib/onboard/gateway-sandbox-reachability.test.ts
Three new tests assert onUnreachable fires before abort for TCP failure, is not called on success, and is not called for probe_unavailable.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Suggested labels

bug-fix, area: onboarding

Suggested reviewers

  • cv
  • ericksoa

🐇 A gateway left open — no more!
With a callback that knocks on the door,
If the bridge can't be found,
We shut the thing down,
And leave no lost process to snore. 🌉✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding logic to stop the gateway process when the sandbox-bridge reachability probe fails during onboarding.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2193-2206: The src/lib/onboard.ts file has exceeded the growth
budget with a net addition of 17 lines, blocking the CI guardrail. To fix this,
reduce the net additions by either condensing the lengthy inline comment block
(the "// `#5513`:" explanation) that precedes the stopGatewayStartedDuringOnboard
function definition, or move the helper function and its supporting
documentation to a separate utility file. Focus on compressing the explanatory
text while preserving the essential logic and functionality of the
stopGatewayStartedDuringOnboard implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 841e023b-2b23-41d5-9288-3a50f900392a

📥 Commits

Reviewing files that changed from the base of the PR and between 01e5525 and 8aa9e98.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts

Comment thread src/lib/onboard.ts Outdated
…VIDIA#5513)

When onboard's [2/8] sandbox-bridge reachability probe fails, NemoClaw
aborts via process.exit(1) without stopping the OpenShell gateway it
started (or reused/adopted) earlier in the same run. The gateway is left
running, bound to the loopback address, so the accompanying "restart
Docker and re-run" hint is misleading: the stale listener survives a
Docker restart and collides with the next attempt.

Add an onUnreachable hook to verifySandboxBridgeGatewayReachableOrExit
that fires only on a genuine unreachable result (not the soft
probe_unavailable skip or a successful probe), and wire it at the three
host-mode gateway paths in startDockerDriverGateway (fresh start, reuse,
adopt) to tear the gateway down via the existing
stopDockerDriverGatewayProcess(). That helper reads the pid file written
in every path and only terminates a verified gateway process, so it is a
safe no-op otherwise.

Follow-up to the bind-address fix in NVIDIA#5534: that change keeps the probe
from failing on Docker Desktop WSL in the first place, while this ensures
any genuine probe failure no longer orphans the gateway.

Signed-off-by: Abhimanyu Kumar <abhimanyukumar7290@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant