fix: retry connection when exporter is temporarily unavailable#829
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a configurable retry timeout for exporter connection attempts, introduces ChangesExporter retry timeout and unreachable handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0a61201 to
7142c52
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (1)
158-188: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick winTiming window is too tight and can make this test flaky.
Line 164 sets
future_endto only 50ms ahead, so scheduler jitter can make the lease appear expired on first poll and intermittently fail thecall_count >= 2assertion at Line 185.Suggested stabilization
- future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50) + future_end = datetime.now(tz=timezone.utc) + timedelta(seconds=1) @@ - async with lease.monitor_async(): - await asyncio.sleep(0.2) + async with lease.monitor_async(): + await asyncio.sleep(1.2)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines 158 - 188, The timing window in the test_uses_cached_end_time_when_poll_fails method is too tight, with future_end set only 50 milliseconds ahead which can cause scheduler jitter to make the lease appear expired prematurely. Increase the timedelta(milliseconds=50) value to a much larger duration (such as several seconds) to provide a stable timing window that ensures the lease does not expire during the asyncio.sleep(0.2) call and allows the call_count >= 2 assertion to reliably pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 298-303: The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 399-405: The try-except block in the connect_router_stream context
manager is catching all exceptions and wrapping them into
ExporterUnavailableError, which masks unrelated failures and cancellation
errors. Replace the broad Exception catch with specific exception types that
represent connection and transport failures, and re-raise any other exceptions
that are not related to router connectivity issues. This allows
lifecycle/cancellation errors and other root causes to propagate correctly
without being incorrectly classified as retryable exporter unavailability.
---
Outside diff comments:
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 158-188: The timing window in the
test_uses_cached_end_time_when_poll_fails method is too tight, with future_end
set only 50 milliseconds ahead which can cause scheduler jitter to make the
lease appear expired prematurely. Increase the timedelta(milliseconds=50) value
to a much larger duration (such as several seconds) to provide a stable timing
window that ensures the lease does not expire during the asyncio.sleep(0.2) call
and allows the call_count >= 2 assertion to reliably pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0be35727-d399-4af5-98ca-7fe5a02c5b38
📒 Files selected for processing (8)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.pypython/packages/jumpstarter-cli/jumpstarter_cli/shell_test.pypython/packages/jumpstarter/jumpstarter/client/client.pypython/packages/jumpstarter/jumpstarter/client/client_test.pypython/packages/jumpstarter/jumpstarter/client/lease.pypython/packages/jumpstarter/jumpstarter/client/lease_test.pypython/packages/jumpstarter/jumpstarter/common/exceptions.pypython/packages/jumpstarter/jumpstarter/config/client.py
| try: | ||
| async with connect_router_stream( | ||
| response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options | ||
| ): | ||
| pass | ||
| except Exception as e: | ||
| raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Do not classify every router-stream exception as retryable exporter unavailability.
This catch-all wraps unrelated failures (and lifecycle/cancellation errors) into ExporterUnavailableError, which can trigger incorrect outer retries and hide root causes. Restrict wrapping to connection/transport failures and re-raise everything else.
Suggested fix
- except Exception as e:
- raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
+ except (AioRpcError, OSError) as e:
+ raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/packages/jumpstarter/jumpstarter/client/lease.py` around lines 399 -
405, The try-except block in the connect_router_stream context manager is
catching all exceptions and wrapping them into ExporterUnavailableError, which
masks unrelated failures and cancellation errors. Replace the broad Exception
catch with specific exception types that represent connection and transport
failures, and re-raise any other exceptions that are not related to router
connectivity issues. This allows lifecycle/cancellation errors and other root
causes to propagate correctly without being incorrectly classified as retryable
exporter unavailability.
|
wait. this have a regression with broken retry mid-session. |
f3051a8 to
dc2ef30
Compare
tested manually:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
482-495: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winNo backoff between unreachable retries.
On
ExporterUnreachableErrorthe loop immediately re-acquires and re-runs. For pre-created leases (release=False, fixedlease_name) re-acquisition resolves quickly via the existing lease, so a persistently dead/suspended exporter is retried in a near-tight loop that hammers the controller (acquire + release RPCs) and the exporter probe untilconnect_deadline. Adding a short bounded sleep/backoff beforecontinuewould reduce load while still respecting the deadline.♻️ Suggested backoff before retry
logger.debug("Unreachable cause: %s", unreachable) + await anyio.sleep(min(5.0, max(0.0, connect_deadline - time.monotonic()))) continue # lease released by __aexit__, loop re-acquires🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py` around lines 482 - 495, Add a short bounded backoff before retrying the unreachable-exporter path in shell.py so the acquire/release loop does not spin tightly; in the `unreachable is not None` branch inside the retry loop, insert a small sleep before `continue` while still honoring `connect_deadline`. Keep the change localized to the retry handling around `ExporterUnreachableError`, `connect_deadline`, and the lease reacquisition logic so pre-created leases with fixed `lease_name` also slow down between attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py`:
- Line 107: The mock lease setup in shell_test is using the wrong timeout
attribute, which can drift from what _shell_with_signal_handling actually reads.
Update the test fixture to set lease.retry_timeout instead of
connect_retry_timeout so the mock matches the production Lease attribute and
exercises the same path used by _shell_with_signal_handling.
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 461-495: The token-expiry monitor is being started on every retry
in the shell retry loop, which leaves stale monitor tasks alive after a lease is
released. Update the flow in the shell retry logic around _monitor_token_expiry
and _run_shell_with_lease_async so the monitor is started only once for the
active session, or explicitly cancel/restart it when lease acquisition retries
occur. Keep the task tied to the current lease lifecycle rather than the outer
while True retry loop.
In `@python/packages/jumpstarter/jumpstarter/config/client.py`:
- Around line 337-341: The effective retry timeout is being read from
JMP_RETRY_TIMEOUT outside Pydantic, so invalid or negative values can slip past
ClientConfigV1Alpha1Lease.retry_timeout validation. Update the client-side
config path around ClientConfigV1Alpha1Lease and the retry_timeout_seconds
calculation in client.py to validate the resolved timeout before building the
lease, and reject non-numeric or negative values with the same constraints as
the model field.
---
Nitpick comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 482-495: Add a short bounded backoff before retrying the
unreachable-exporter path in shell.py so the acquire/release loop does not spin
tightly; in the `unreachable is not None` branch inside the retry loop, insert a
small sleep before `continue` while still honoring `connect_deadline`. Keep the
change localized to the retry handling around `ExporterUnreachableError`,
`connect_deadline`, and the lease reacquisition logic so pre-created leases with
fixed `lease_name` also slow down between attempts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4386b61e-31e4-4c6d-8ee9-2aca9330091a
📒 Files selected for processing (11)
python/packages/jumpstarter-cli/jumpstarter_cli/common.pypython/packages/jumpstarter-cli/jumpstarter_cli/shell.pypython/packages/jumpstarter-cli/jumpstarter_cli/shell_test.pypython/packages/jumpstarter/jumpstarter/client/client.pypython/packages/jumpstarter/jumpstarter/client/client_test.pypython/packages/jumpstarter/jumpstarter/client/lease.pypython/packages/jumpstarter/jumpstarter/client/lease_test.pypython/packages/jumpstarter/jumpstarter/common/exceptions.pypython/packages/jumpstarter/jumpstarter/config/client.pypython/packages/jumpstarter/jumpstarter/config/client_config_test.pypython/packages/jumpstarter/jumpstarter/config/env.py
f05e107 to
66c82b8
Compare
| f"{lease.retry_timeout:.0f}s of retrying" | ||
| ) from unreachable | ||
| logger.warning( | ||
| "Exporter %s is unreachable, releasing lease and retrying...", |
There was a problem hiding this comment.
On a non-named lease it does not release, simply retries.
There was a problem hiding this comment.
yeah, cosmetic but definitely worth fixing. i'll probably do a follow up pr for this one and for better test cases
mangelajo
left a comment
There was a problem hiding this comment.
Very clean patch TBH, good work
|
I am disabling the retries on some more E2E tests that don't need/want retries, because otherwise it bumps the E2E test run by 8-10m |
When `jmp shell` connects to an exporter that was killed or suspended, GetReport fails with UNAVAILABLE. Convert this to ExporterUnreachableError in client_from_path so the retry loop in _shell_with_signal_handling can release the lease and re-acquire, eventually reaching a live exporter. Handle ExporterUnreachableError both bare and when wrapped in BaseExceptionGroup by nested task groups (TemporaryUnixListener). Narrow the get_status_async error wrapping to specific gRPC status codes instead of broad Exception. Add connect_timeout (default 5m) to bound the retry window for both auto-created and pre-created leases. Also fix 2**attempt overflow in the Dial retry loop by capping the exponent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- client_test: verify client_from_path converts UNAVAILABLE and DEADLINE_EXCEEDED to ExporterUnreachableError, other codes propagate - shell_test: verify retry loop times out with connect_timeout, succeeds when exporter recovers, and retries when ExporterUnreachableError is wrapped in BaseExceptionGroup - lease_test: fix MockAioRpcError attrs for pytest display, cap 2**attempt in retry tests to prevent overflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename connect_timeout to retry_timeout and expose it at all three levels of the client configuration hierarchy: - CLI flag: --retry-timeout (e.g., '5m', '30s', '0' to disable) - Environment variable: JMP_RETRY_TIMEOUT (seconds) - Config file: leases.retry_timeout (seconds, default 300) Setting retry_timeout to 0 disables the retry loop entirely, which is useful for e2e tests where the exporter intentionally exits (e.g., beforeLease onFailure=exit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests B2, B3, B4, and G2 all invoke jmp shell against exporters that terminate or release the lease after a hook failure. With the new default retry-timeout of 5m, the client retries for up to 5 minutes before giving up, adding ~8 minutes to the E2E suite. Pass --retry-timeout 0 to disable retry so these tests fail fast as intended.
When beforeLease hook fails with onFailure: exit, the handler reported BEFORE_LEASE_HOOK_FAILED then immediately OFFLINE, clobbering the failure status before the client could poll it. This caused a race where wait_for_any_of missed the target status and blocked for the full 300s timeout, producing the e2e B5/C2 test timeouts. Remove the early OFFLINE report — the shutdown/unregistration path reports it reliably. The controller's filterOutNotReadyExporters already prevents lease assignment to exporters with non-Available status, so no lease-assignment regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a41e743 to
ba142cf
Compare
When
jmp shellconnects to an exporter that was killed, suspended, ornetwork-partitioned, the initial
GetReportRPC (insideclient_from_channel)fails with
AioRpcError(UNAVAILABLE). Previously this surfaced as anunhandled gRPC error. Now the client releases the lease and retries the full
connection cycle.
Changes
gRPC error conversion (
client.py):client_from_pathwrapsclient_from_channelso thatAioRpcError(UNAVAILABLE | DEADLINE_EXCEEDED)from the initial
GetReportcall is converted toExporterUnreachableError.Other gRPC codes propagate unchanged.
Retry loop (
shell.py):_shell_with_signal_handlingcatchesExporterUnreachableErrorboth bare and when wrapped inBaseExceptionGroup(by nested task groups likeTemporaryUnixListener).On catch, the lease context manager exits (releasing the lease), a warning
is logged, and the loop re-acquires. A
connect_timeout(default 5 minutes)bounds the total retry window. For auto-created leases (
--selector), thecontroller may assign a different exporter on each cycle. For pre-created
leases (
--lease), the same exporter is retried until it recovers or thetimeout expires.
Status check narrowing (
shell.py): Theget_status_asyncerrorwrapping now catches only
AioRpcErrorwithUNAVAILABLE | DEADLINE_EXCEEDEDinstead of broad
Exception, so programming errors propagate normally.connect_timeoutconfig (config/client.py,lease.py): New field onClientConfigV1Alpha1Lease(default 300s), passed through to theLeasedataclass.
Overflow fix (
lease.py): Cap2 ** attemptto2 ** min(attempt, 10)in the Dial retry backoff to prevent
OverflowErrorwhensleepis mockedin tests.
Behaviour
ExporterUnreachableErrorwith elapsed timeTesting
client_test.py:GetReporterror conversion for UNAVAILABLE, DEADLINE_EXCEEDED, and passthrough of other codesshell_test.py: retry loop timeout exhaustion, successful retry when exporter recovers, and retry whenExporterUnreachableErroris wrapped inBaseExceptionGrouplease_test.py: fixMockAioRpcErrorinternal attrs for pytest display