Skip to content

fix: retry connection when exporter is temporarily unavailable#829

Merged
mangelajo merged 6 commits into
jumpstarter-dev:mainfrom
evakhoni:fix/handle-dead-exporter
Jun 26, 2026
Merged

fix: retry connection when exporter is temporarily unavailable#829
mangelajo merged 6 commits into
jumpstarter-dev:mainfrom
evakhoni:fix/handle-dead-exporter

Conversation

@evakhoni

@evakhoni evakhoni commented Jun 23, 2026

Copy link
Copy Markdown
Member

When jmp shell connects to an exporter that was killed, suspended, or
network-partitioned, the initial GetReport RPC (inside client_from_channel)
fails with AioRpcError(UNAVAILABLE). Previously this surfaced as an
unhandled gRPC error. Now the client releases the lease and retries the full
connection cycle.

Changes

gRPC error conversion (client.py): client_from_path wraps
client_from_channel so that AioRpcError(UNAVAILABLE | DEADLINE_EXCEEDED)
from the initial GetReport call is converted to ExporterUnreachableError.
Other gRPC codes propagate unchanged.

Retry loop (shell.py): _shell_with_signal_handling catches
ExporterUnreachableError both bare and when wrapped in
BaseExceptionGroup (by nested task groups like TemporaryUnixListener).
On catch, the lease context manager exits (releasing the lease), a warning
is logged, and the loop re-acquires. A connect_timeout (default 5 minutes)
bounds the total retry window. For auto-created leases (--selector), the
controller may assign a different exporter on each cycle. For pre-created
leases (--lease), the same exporter is retried until it recovers or the
timeout expires.

Status check narrowing (shell.py): The get_status_async error
wrapping now catches only AioRpcError with UNAVAILABLE | DEADLINE_EXCEEDED
instead of broad Exception, so programming errors propagate normally.

connect_timeout config (config/client.py, lease.py): New field on
ClientConfigV1Alpha1Lease (default 300s), passed through to the Lease
dataclass.

Overflow fix (lease.py): Cap 2 ** attempt to 2 ** min(attempt, 10)
in the Dial retry backoff to prevent OverflowError when sleep is mocked
in tests.

Behaviour

Scenario Outcome
Exporter killed, controller not yet aware Retries with lease release/re-acquire, up to 5m
Exporter recovers during retry window Succeeds on next attempt
Pre-created lease, exporter recovers Same exporter retried, succeeds when it comes back
Retry budget exhausted ExporterUnreachableError with elapsed time

Testing

  • client_test.py: GetReport error conversion for UNAVAILABLE, DEADLINE_EXCEEDED, and passthrough of other codes
  • shell_test.py: retry loop timeout exhaustion, successful retry when exporter recovers, and retry when ExporterUnreachableError is wrapped in BaseExceptionGroup
  • lease_test.py: fix MockAioRpcError internal attrs for pytest display

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a configurable retry timeout for exporter connection attempts, introduces ExporterUnreachableError, converts initial gRPC probe failures into that exception, and retries shell execution until the retry budget expires.

Changes

Exporter retry timeout and unreachable handling

Layer / File(s) Summary
Retry timeout config
python/packages/jumpstarter/jumpstarter/config/env.py, python/packages/jumpstarter/jumpstarter/config/client.py, python/packages/jumpstarter/jumpstarter/client/lease.py, python/packages/jumpstarter/jumpstarter/config/client_config_test.py
retry_timeout is added to the env/config surface, stored on Lease, used to cap dial retry delays, and included in saved client config output.
Unreachable exporter mapping
python/packages/jumpstarter/jumpstarter/common/exceptions.py, python/packages/jumpstarter/jumpstarter/client/client.py, python/packages/jumpstarter/jumpstarter/client/client_test.py, python/packages/jumpstarter-cli/jumpstarter_cli/shell.py, python/packages/jumpstarter/jumpstarter/client/lease_test.py
ExporterUnreachableError is added, initial gRPC failures in client_from_path and shell status probing are converted into it, and tests use deterministic gRPC mocks for conversion and pass-through cases.
Shell retry loop
python/packages/jumpstarter-cli/jumpstarter_cli/common.py, python/packages/jumpstarter-cli/jumpstarter_cli/shell.py, python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py, e2e/test/hooks_test.go
The shell command adds --retry-timeout, threads it into shell execution, retries exporter-unreachable failures until the deadline, and updates the shell tests and e2e shell invocations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

python

Suggested reviewers

  • mangelajo
  • raballew

Poem

I hopped through gRPC fog at dawn,
With retry carrots brightly drawn.
If an exporter blinked away,
I’d hop again till it said “stay.” 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: retrying shell connections when the exporter is temporarily unavailable.
Description check ✅ Passed The description accurately matches the implemented retry, error conversion, timeout, and config changes in the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@evakhoni evakhoni requested a review from mangelajo June 23, 2026 16:37
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch from 0a61201 to 7142c52 Compare June 23, 2026 16:40

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (1)

158-188: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Timing window is too tight and can make this test flaky.

Line 164 sets future_end to only 50ms ahead, so scheduler jitter can make the lease appear expired on first poll and intermittently fail the call_count >= 2 assertion at Line 185.

Suggested stabilization
-        future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50)
+        future_end = datetime.now(tz=timezone.utc) + timedelta(seconds=1)
@@
-            async with lease.monitor_async():
-                await asyncio.sleep(0.2)
+            async with lease.monitor_async():
+                await asyncio.sleep(1.2)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines
158 - 188, The timing window in the test_uses_cached_end_time_when_poll_fails
method is too tight, with future_end set only 50 milliseconds ahead which can
cause scheduler jitter to make the lease appear expired prematurely. Increase
the timedelta(milliseconds=50) value to a much larger duration (such as several
seconds) to provide a stable timing window that ensures the lease does not
expire during the asyncio.sleep(0.2) call and allows the call_count >= 2
assertion to reliably pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 298-303: The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.

In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 399-405: The try-except block in the connect_router_stream context
manager is catching all exceptions and wrapping them into
ExporterUnavailableError, which masks unrelated failures and cancellation
errors. Replace the broad Exception catch with specific exception types that
represent connection and transport failures, and re-raise any other exceptions
that are not related to router connectivity issues. This allows
lifecycle/cancellation errors and other root causes to propagate correctly
without being incorrectly classified as retryable exporter unavailability.

---

Outside diff comments:
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 158-188: The timing window in the
test_uses_cached_end_time_when_poll_fails method is too tight, with future_end
set only 50 milliseconds ahead which can cause scheduler jitter to make the
lease appear expired prematurely. Increase the timedelta(milliseconds=50) value
to a much larger duration (such as several seconds) to provide a stable timing
window that ensures the lease does not expire during the asyncio.sleep(0.2) call
and allows the call_count >= 2 assertion to reliably pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0be35727-d399-4af5-98ca-7fe5a02c5b38

📥 Commits

Reviewing files that changed from the base of the PR and between 4a5d98e and 7142c52.

📒 Files selected for processing (8)
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell.py
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py
  • python/packages/jumpstarter/jumpstarter/client/client.py
  • python/packages/jumpstarter/jumpstarter/client/client_test.py
  • python/packages/jumpstarter/jumpstarter/client/lease.py
  • python/packages/jumpstarter/jumpstarter/client/lease_test.py
  • python/packages/jumpstarter/jumpstarter/common/exceptions.py
  • python/packages/jumpstarter/jumpstarter/config/client.py

Comment thread python/packages/jumpstarter-cli/jumpstarter_cli/shell.py Outdated
Comment on lines +399 to +405
try:
async with connect_router_stream(
response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options
):
pass
except Exception as e:
raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not classify every router-stream exception as retryable exporter unavailability.

This catch-all wraps unrelated failures (and lifecycle/cancellation errors) into ExporterUnavailableError, which can trigger incorrect outer retries and hide root causes. Restrict wrapping to connection/transport failures and re-raise everything else.

Suggested fix
-        except Exception as e:
-            raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
+        except (AioRpcError, OSError) as e:
+            raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/client/lease.py` around lines 399 -
405, The try-except block in the connect_router_stream context manager is
catching all exceptions and wrapping them into ExporterUnavailableError, which
masks unrelated failures and cancellation errors. Replace the broad Exception
catch with specific exception types that represent connection and transport
failures, and re-raise any other exceptions that are not related to router
connectivity issues. This allows lifecycle/cancellation errors and other root
causes to propagate correctly without being incorrectly classified as retryable
exporter unavailability.

@evakhoni

Copy link
Copy Markdown
Member Author

wait. this have a regression with broken retry mid-session.

@mangelajo mangelajo added this to the 0.9.0 milestone Jun 24, 2026
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch 5 times, most recently from f3051a8 to dc2ef30 Compare June 25, 2026 08:06
@evakhoni

Copy link
Copy Markdown
Member Author

tested manually:

  • jmp shell to killed exporter retries until marked offline, then returns into waiting for another exporter. if exporter recovered it continues as expected.

  • jmp shell to suspended exporter (never marked offline) quits in 5 minutes, unsuspended exporter continues normally. suspended - lease destroyed and not stuck - as expected (not tested with hooks yet)

  • an exporter restarted in the middle of an open connection - recovers fully with next j command working as usual (haven't tested what would happen to hooks in this case though)

  • lease created in advance to suspended exporter is in the same 5m retry loop once used, recovers if unsuspended, otherwise times out and exits. after it's exit, if then unsuspended and the client restarted to attempt again - recovers.

  • lease created in advance on a killed exporter does nothing when it becomes offline, client attempt to use the lease is bound to 300s timeout. if the exporter was marked offline once the client already connected to the lease results in a similar outcome.

@evakhoni evakhoni marked this pull request as ready for review June 25, 2026 14:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

482-495: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

No backoff between unreachable retries.

On ExporterUnreachableError the loop immediately re-acquires and re-runs. For pre-created leases (release=False, fixed lease_name) re-acquisition resolves quickly via the existing lease, so a persistently dead/suspended exporter is retried in a near-tight loop that hammers the controller (acquire + release RPCs) and the exporter probe until connect_deadline. Adding a short bounded sleep/backoff before continue would reduce load while still respecting the deadline.

♻️ Suggested backoff before retry
                                 logger.debug("Unreachable cause: %s", unreachable)
+                                await anyio.sleep(min(5.0, max(0.0, connect_deadline - time.monotonic())))
                                 continue  # lease released by __aexit__, loop re-acquires
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py` around lines 482 -
495, Add a short bounded backoff before retrying the unreachable-exporter path
in shell.py so the acquire/release loop does not spin tightly; in the
`unreachable is not None` branch inside the retry loop, insert a small sleep
before `continue` while still honoring `connect_deadline`. Keep the change
localized to the retry handling around `ExporterUnreachableError`,
`connect_deadline`, and the lease reacquisition logic so pre-created leases with
fixed `lease_name` also slow down between attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py`:
- Line 107: The mock lease setup in shell_test is using the wrong timeout
attribute, which can drift from what _shell_with_signal_handling actually reads.
Update the test fixture to set lease.retry_timeout instead of
connect_retry_timeout so the mock matches the production Lease attribute and
exercises the same path used by _shell_with_signal_handling.

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 461-495: The token-expiry monitor is being started on every retry
in the shell retry loop, which leaves stale monitor tasks alive after a lease is
released. Update the flow in the shell retry logic around _monitor_token_expiry
and _run_shell_with_lease_async so the monitor is started only once for the
active session, or explicitly cancel/restart it when lease acquisition retries
occur. Keep the task tied to the current lease lifecycle rather than the outer
while True retry loop.

In `@python/packages/jumpstarter/jumpstarter/config/client.py`:
- Around line 337-341: The effective retry timeout is being read from
JMP_RETRY_TIMEOUT outside Pydantic, so invalid or negative values can slip past
ClientConfigV1Alpha1Lease.retry_timeout validation. Update the client-side
config path around ClientConfigV1Alpha1Lease and the retry_timeout_seconds
calculation in client.py to validate the resolved timeout before building the
lease, and reject non-numeric or negative values with the same constraints as
the model field.

---

Nitpick comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 482-495: Add a short bounded backoff before retrying the
unreachable-exporter path in shell.py so the acquire/release loop does not spin
tightly; in the `unreachable is not None` branch inside the retry loop, insert a
small sleep before `continue` while still honoring `connect_deadline`. Keep the
change localized to the retry handling around `ExporterUnreachableError`,
`connect_deadline`, and the lease reacquisition logic so pre-created leases with
fixed `lease_name` also slow down between attempts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4386b61e-31e4-4c6d-8ee9-2aca9330091a

📥 Commits

Reviewing files that changed from the base of the PR and between 7142c52 and 764826c.

📒 Files selected for processing (11)
  • python/packages/jumpstarter-cli/jumpstarter_cli/common.py
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell.py
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py
  • python/packages/jumpstarter/jumpstarter/client/client.py
  • python/packages/jumpstarter/jumpstarter/client/client_test.py
  • python/packages/jumpstarter/jumpstarter/client/lease.py
  • python/packages/jumpstarter/jumpstarter/client/lease_test.py
  • python/packages/jumpstarter/jumpstarter/common/exceptions.py
  • python/packages/jumpstarter/jumpstarter/config/client.py
  • python/packages/jumpstarter/jumpstarter/config/client_config_test.py
  • python/packages/jumpstarter/jumpstarter/config/env.py

Comment thread python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py Outdated
Comment thread python/packages/jumpstarter-cli/jumpstarter_cli/shell.py
Comment thread python/packages/jumpstarter/jumpstarter/config/client.py
f"{lease.retry_timeout:.0f}s of retrying"
) from unreachable
logger.warning(
"Exporter %s is unreachable, releasing lease and retrying...",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a non-named lease it does not release, simply retries.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, cosmetic but definitely worth fixing. i'll probably do a follow up pr for this one and for better test cases

@mangelajo mangelajo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean patch TBH, good work

@mangelajo

Copy link
Copy Markdown
Member

I am disabling the retries on some more E2E tests that don't need/want retries, because otherwise it bumps the E2E test run by 8-10m

@mangelajo mangelajo enabled auto-merge June 26, 2026 09:45
@mangelajo mangelajo added this pull request to the merge queue Jun 26, 2026
evakhoni and others added 6 commits June 26, 2026 13:14
When `jmp shell` connects to an exporter that was killed or suspended,
GetReport fails with UNAVAILABLE. Convert this to ExporterUnreachableError
in client_from_path so the retry loop in _shell_with_signal_handling can
release the lease and re-acquire, eventually reaching a live exporter.

Handle ExporterUnreachableError both bare and when wrapped in
BaseExceptionGroup by nested task groups (TemporaryUnixListener).
Narrow the get_status_async error wrapping to specific gRPC status
codes instead of broad Exception.

Add connect_timeout (default 5m) to bound the retry window for both
auto-created and pre-created leases. Also fix 2**attempt overflow in
the Dial retry loop by capping the exponent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- client_test: verify client_from_path converts UNAVAILABLE and
  DEADLINE_EXCEEDED to ExporterUnreachableError, other codes propagate
- shell_test: verify retry loop times out with connect_timeout,
  succeeds when exporter recovers, and retries when
  ExporterUnreachableError is wrapped in BaseExceptionGroup
- lease_test: fix MockAioRpcError attrs for pytest display, cap
  2**attempt in retry tests to prevent overflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename connect_timeout to retry_timeout and expose it at all three
levels of the client configuration hierarchy:

- CLI flag: --retry-timeout (e.g., '5m', '30s', '0' to disable)
- Environment variable: JMP_RETRY_TIMEOUT (seconds)
- Config file: leases.retry_timeout (seconds, default 300)

Setting retry_timeout to 0 disables the retry loop entirely, which is
useful for e2e tests where the exporter intentionally exits (e.g.,
beforeLease onFailure=exit).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests B2, B3, B4, and G2 all invoke jmp shell against exporters that
terminate or release the lease after a hook failure. With the new
default retry-timeout of 5m, the client retries for up to 5 minutes
before giving up, adding ~8 minutes to the E2E suite.

Pass --retry-timeout 0 to disable retry so these tests fail fast
as intended.
When beforeLease hook fails with onFailure: exit, the handler reported
BEFORE_LEASE_HOOK_FAILED then immediately OFFLINE, clobbering the
failure status before the client could poll it. This caused a race
where wait_for_any_of missed the target status and blocked for the
full 300s timeout, producing the e2e B5/C2 test timeouts.

Remove the early OFFLINE report — the shutdown/unregistration path
reports it reliably. The controller's filterOutNotReadyExporters
already prevents lease assignment to exporters with non-Available
status, so no lease-assignment regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch from a41e743 to ba142cf Compare June 26, 2026 10:22
Merged via the queue into jumpstarter-dev:main with commit 26aee22 Jun 26, 2026
1 check passed
@evakhoni evakhoni deleted the fix/handle-dead-exporter branch June 26, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants