Recover from opencode stalls and control-plane timeouts#66
Conversation
📝 WalkthroughWalkthroughImplements end-to-end resilience for ChangesResume Server-Death Resilience
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Coverage Report
Generated by pytest-cov on |
There was a problem hiding this comment.
Pull request overview
This PR improves CodeCome’s resilience to opencode server crashes, control-plane timeouts, and hung busy turns by introducing structured run statuses, better SSE liveness handling, and automated server restart/retry flows (including resuming Phase 1 from the failed subphase only).
Changes:
- Adds
RunStatuscodes and propagatesserver_unreachable/session_stalledas first-class recoverable conditions throughout the runner/harness/Phase 1 orchestration. - Hardens SSE consumption so long busy turns aren’t abandoned (reconnect past budget while busy+alive) while still bounding genuine hangs (stall watchdog + read-tick inactivity).
- Improves operational diagnostics and workspace hygiene (server exit-code logging; Erlang crash dump redirected into ignored
tmp/; documents new env knobs).
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/phases/completion.py | Adds resume opener messaging for server_unreachable and session_stalled. |
| tools/opencode/serve.py | Adds process-handle liveness checks, restart(), and a daemon exit-code log monitor. |
| tools/events/sse_client.py | Adds read-tick inactivity handling and should_continue to avoid abandoning busy turns. |
| tools/events/phase_loop.py | Adds busy/idle tracking, stall watchdog, status-probe idle recovery, and stall propagation in RunResult. |
| tools/events/base.py | Adds best-effort /session/status probe helper (absence ⇒ idle). |
| tools/codecome/status.py | Introduces shared RunStatus IntEnum and normalization helper. |
| tools/codecome/session.py | Adds structured OpenCodeRequestError, retries for session creation, and health probing. |
| tools/codecome/runner.py | Threads liveness checks through resume wait + event consumption; maps retriable HTTP failures to recoverable outcomes. |
| tools/codecome/phase_1.py | Returns structured Phase 1 outcome and enables retry-from-failed-subphase after server restart. |
| tools/codecome/harness.py | Centralizes restart+retry logic for recoverable conditions; adds restart budgets for Phase 1 and phases 2–6. |
| tests/test_sse_busy_resilience.py | Adds coverage for reconnect budgeting, read-tick behavior, stall watchdog, and idle recovery sync. |
| tests/test_session.py | Adds coverage for create-session retry behavior and retriable request errors. |
| tests/test_sandbox_bootstrap.py | Verifies Erlang crash dump is redirected to ignored tmp/. |
| tests/test_runner_resume_health.py | Tests resume readiness logic with health/liveness probes and stall propagation mapping. |
| tests/test_harness_recovery_restart.py | Tests harness restart/retry behavior and shared restart budget exhaustion. |
| tests/test_codecome_runner.py | Updates runner tests for recoverable prompt/session failures using OpenCodeRequestError. |
| templates/sandboxes/erlang-otp/docker-compose.yml | Sets ERL_CRASH_DUMP to /workspace/tmp/erl_crash.dump. |
| README.md | Documents new resilience/recovery environment variables and behavior. |
| .project/resume-server-death-resilience-plan.md | Adds implementation plan/design notes for the resilience work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with pytest.raises(RuntimeError, match="empty session ID"): | ||
| module.create_session("http://localhost:8080", "1", "recon", None, None, None) | ||
|
|
There was a problem hiding this comment.
Addressed in the follow-up patch: the empty-session test now expects OpenCodeRequestError and asserts retriable=False plus operation="create_session". Validated with targeted tests and make tests.
| self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) | ||
| # Optional diagnostic signal: opencode may pause server.heartbeat during | ||
| # valid long turns, so heartbeat-gap stalls are disabled by default. When | ||
| # explicitly enabled, they can flag runtime blockage sooner than the | ||
| # primary no-progress timeout. | ||
| self._heartbeat_stall_timeout_s = float( | ||
| os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") | ||
| ) |
There was a problem hiding this comment.
Addressed in the follow-up patch: CODECOME_BUSY_STALL_TIMEOUT and CODECOME_HEARTBEAT_STALL_TIMEOUT now use safe float parsing with defaults, matching the defensive behavior used for the probe timeout. Added regression coverage for malformed values.
Greptile SummaryThis PR hardens the phase-mode execution loop against two real-world opencode failure modes — server crashes and hung model turns — by adding structured recovery paths throughout the harness, phase orchestration, and SSE client layers.
Confidence Score: 4/5Safe to merge; all new recovery paths are protected by retry budgets and the design correctly delegates server lifecycle exclusively to the harness. The changes are well-reasoned, thoroughly tested, and handle real production failures documented with PIDs and timestamps. The one notable imperfection is in _wait_for_resume_idle: when the server process is alive but its status endpoint is unavailable, the code records both a specific status_unavailable_process_alive event and a redundant blocked_unknown event in the same poll cycle, which could mislead transcript analysis tooling. The silent-fallback in _set_stream_timeout (can't find the socket → read-tick watchdog silently doesn't work) is another minor observability gap. Neither affects correctness of the primary recovery flows. tools/codecome/runner.py — the _wait_for_resume_idle process-alive branch falls through to record a redundant blocked_unknown transcript event. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant H as Harness
participant P1 as run_phase_1
participant SP as _run_subphase
participant RA as _run_single_attempt
participant PEL as PhaseEventLoop
participant SSE as SseClient
participant OC as opencode serve
H->>P1: "run_phase_1(start_at=1a)"
P1->>SP: "_run_subphase(phase_id=1a)"
SP->>RA: _run_single_attempt(liveness_check)
RA->>OC: create_session / send_prompt
RA->>PEL: _consume_events(liveness_check)
PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
loop SSE stream
SSE->>OC: GET /event readline with socket timeout 10s
alt event arrives
OC-->>SSE: SSE event
SSE-->>PEL: yield event
PEL->>PEL: _note_progress / _track_session_busy
else silent tick socket.timeout
SSE-->>PEL: yield None
PEL->>PEL: _should_keep_consuming stall watchdog
end
alt session idle
PEL-->>RA: "RunResult session_stalled=False"
else stall detected 180s no progress
PEL->>PEL: _status_probe_reports_idle
alt REST says idle
PEL-->>RA: "RunResult idle_via_status=True"
else still busy
PEL-->>RA: "RunResult session_stalled=True"
end
else server process died
PEL->>PEL: liveness_check False stop
PEL-->>RA: "RunResult finish=server_unreachable"
end
end
RA-->>SP: RunStatus run_result
SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
P1-->>H: Phase1Outcome
alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
H->>H: runner.restart
H->>P1: "run_phase_1 start_at=failed_subphase"
else
H-->>H: return exit code
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant H as Harness
participant P1 as run_phase_1
participant SP as _run_subphase
participant RA as _run_single_attempt
participant PEL as PhaseEventLoop
participant SSE as SseClient
participant OC as opencode serve
H->>P1: "run_phase_1(start_at=1a)"
P1->>SP: "_run_subphase(phase_id=1a)"
SP->>RA: _run_single_attempt(liveness_check)
RA->>OC: create_session / send_prompt
RA->>PEL: _consume_events(liveness_check)
PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
loop SSE stream
SSE->>OC: GET /event readline with socket timeout 10s
alt event arrives
OC-->>SSE: SSE event
SSE-->>PEL: yield event
PEL->>PEL: _note_progress / _track_session_busy
else silent tick socket.timeout
SSE-->>PEL: yield None
PEL->>PEL: _should_keep_consuming stall watchdog
end
alt session idle
PEL-->>RA: "RunResult session_stalled=False"
else stall detected 180s no progress
PEL->>PEL: _status_probe_reports_idle
alt REST says idle
PEL-->>RA: "RunResult idle_via_status=True"
else still busy
PEL-->>RA: "RunResult session_stalled=True"
end
else server process died
PEL->>PEL: liveness_check False stop
PEL-->>RA: "RunResult finish=server_unreachable"
end
end
RA-->>SP: RunStatus run_result
SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
P1-->>H: Phase1Outcome
alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
H->>H: runner.restart
H->>P1: "run_phase_1 start_at=failed_subphase"
else
H-->>H: return exit code
end
Reviews (1): Last reviewed commit: "fix: redirect erlang crash dumps to igno..." | Re-trigger Greptile |
| if liveness_check is not None: | ||
| if liveness_check(): | ||
| consecutive_unavailable = 0 | ||
| _record_codecome_event( | ||
| transcript, | ||
| "codecome.resume.status_unavailable_process_alive", | ||
| sessionID=session_id, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Redundant
blocked_unknown event when process is alive. When liveness_check() returns True and status is None, the code records status_unavailable_process_alive but then falls through unconditionally to also record blocked_unknown (since status != "busy"). This writes two semantically conflicting events for the same poll cycle — the second one is misleading because we do know the reason. Adding continue after the alive-branch avoids the duplication.
| if liveness_check is not None: | |
| if liveness_check(): | |
| consecutive_unavailable = 0 | |
| _record_codecome_event( | |
| transcript, | |
| "codecome.resume.status_unavailable_process_alive", | |
| sessionID=session_id, | |
| ) | |
| else: | |
| if liveness_check is not None: | |
| if liveness_check(): | |
| consecutive_unavailable = 0 | |
| _record_codecome_event( | |
| transcript, | |
| "codecome.resume.status_unavailable_process_alive", | |
| sessionID=session_id, | |
| ) | |
| if time.monotonic() >= deadline: | |
| _record_codecome_event( | |
| transcript, | |
| "codecome.resume.timeout", | |
| sessionID=session_id, | |
| status=None, | |
| timeoutSeconds=timeout_s, | |
| ) | |
| raise ResumeSessionNotReady( | |
| f"session {session_id} status endpoint unavailable (process alive) " | |
| f"after {timeout_s:g}s; refusing to send resume prompt" | |
| ) | |
| time.sleep(max(poll_s, 0.1)) | |
| continue | |
| else: |
There was a problem hiding this comment.
Addressed in the follow-up patch: the live-process/status-unavailable branch now records the specific event, handles timeout/sleep, and continues without falling through to blocked_unknown. Added a regression assertion for this event stream.
| def _set_stream_timeout(resp: Any, timeout: float) -> None: | ||
| """Best-effort: set a per-read socket timeout on the response stream.""" | ||
| fp = getattr(resp, "fp", None) | ||
| sock = None | ||
| if fp is not None: | ||
| raw = getattr(fp, "raw", None) | ||
| sock = getattr(raw, "_sock", None) or getattr(fp, "_sock", None) | ||
| if sock is None: | ||
| sock = getattr(resp, "_sock", None) | ||
| if sock is not None and hasattr(sock, "settimeout"): | ||
| try: | ||
| sock.settimeout(timeout) | ||
| except Exception: # noqa: BLE001 | ||
| pass |
There was a problem hiding this comment.
_set_stream_timeout silently no-ops on failure with no observable signal
If the socket attribute path (resp.fp.raw._sock, resp.fp._sock, or resp._sock) can't be resolved — which can happen across Python versions, proxied connections, or alternate urllib implementations — the timeout is never set and resp.readline() blocks forever. When that happens, Fix 6 (the read-tick stall watchdog) silently stops working: no None ticks are ever yielded, _should_keep_consuming() is never re-evaluated on a silent-but-open stream, and a stalled session hangs indefinitely.
Consider logging a one-shot debug warning (e.g. to sys.stderr or a logger) when sock is None so this failure mode is detectable in production logs rather than appearing as a plain hang.
There was a problem hiding this comment.
Addressed in the follow-up patch as an observability improvement: _set_stream_timeout now emits a one-shot stderr warning if the socket cannot be found or settimeout fails, with regression coverage.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/codecome/harness.py`:
- Around line 328-362: The stalled-recovery path in the harness can break out
with a success status when the restart budget is exhausted, because
`_recovery_reason` from `session_stalled` does not force a non-success result.
Update the recovery handling in `harness.py` so that the `session_stalled`
branch sets `returncode`/final status to a failure state before the
budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.
In `@tools/codecome/runner.py`:
- Around line 297-308: Returning RunStatus.INCOMPLETE from the
OpenCodeRequestError retriable path can happen while the SSE consumer thread is
still alive, which lets the abandoned session keep running as the harness
retries. Update the retry/error handling in runner.py around the
consumer.join(timeout=5.0) and the OpenCodeRequestError branch so that a
timed-out or still-active consumer is first stopped/drained (or treated as a
terminal failure for this attempt) before any recoverable result is returned.
Use the existing _record_codecome_event and RunStatus/RunResult flow, but ensure
this path does not hand back a fresh retryable state until the consumer has
fully exited.
In `@tools/codecome/session.py`:
- Around line 225-228: Close the HTTP response in create_session() when calling
urllib.request.urlopen so retries don’t leak sockets. Update the request
handling in session.py to use the same context-manager pattern already used by
the status/health helpers, ensuring the response object is always closed after
reading and parsing the JSON.
In `@tools/events/phase_loop.py`:
- Around line 206-207: The dead-process path in phase_loop should be surfaced as
a recoverable outcome instead of falling through as a generic incomplete run.
Update the liveness_check() failure handling in phase_loop.py so a false result
sets a distinct finish reason such as server_unreachable and/or session_stalled
using the existing _last_finish_reason flow, then ensure the runner maps that
finish reason into the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE
recovery branch before harness branching. Use the phase_loop loop logic and the
runner’s finish-reason/status conversion code to keep recovery behavior
consistent.
- Around line 68-75: The stall timeout environment parsing in PhaseEventLoop
should be made resilient so invalid values do not raise during construction.
Update the initialization logic around the _stall_timeout_s and
_heartbeat_stall_timeout_s assignments in PhaseEventLoop to parse with a safe
fallback, matching the behavior used for CODECOME_RESUME_PROBE_TIMEOUT. If
either CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT is
missing or malformed, keep the default timeout value instead of letting float()
fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e79ed5d1-cadf-4f73-989e-1fc313b505ca
📒 Files selected for processing (19)
.project/resume-server-death-resilience-plan.mdREADME.mdtemplates/sandboxes/erlang-otp/docker-compose.ymltests/test_codecome_runner.pytests/test_harness_recovery_restart.pytests/test_runner_resume_health.pytests/test_sandbox_bootstrap.pytests/test_session.pytests/test_sse_busy_resilience.pytools/codecome/harness.pytools/codecome/phase_1.pytools/codecome/runner.pytools/codecome/session.pytools/codecome/status.pytools/events/base.pytools/events/phase_loop.pytools/events/sse_client.pytools/opencode/serve.pytools/phases/completion.py
| _recovery_reason = None | ||
| if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable": | ||
| _recovery_reason = "server_unreachable" | ||
| elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled": | ||
| _recovery_reason = "session_stalled" | ||
|
|
||
| if _recovery_reason is not None: | ||
| if server_restart_count < max_server_restarts: | ||
| server_restart_count += 1 | ||
| out.warn( | ||
| f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} " | ||
| f"CodeCome will restart the opencode server and retry with a fresh session " | ||
| f"(server restart {server_restart_count}/{max_server_restarts})." | ||
| ) | ||
| try: | ||
| server_info = _restart_server(runner, log_level=args.log_level) | ||
| except ServerRunnerError as exc: | ||
| _emit_fatal_error(console, "Server Restart Error", str(exc)) | ||
| return int(RunStatus.ERROR) | ||
| base_url = server_info.base_url | ||
| last_session_id = "" | ||
| prompt = build_phase_resume_prompt( | ||
| str(args.phase), args.finding, | ||
| _recovery_reason, step_finish_count, | ||
| failure_details=phase_failures if phase_failures else None, | ||
| ) | ||
| continue | ||
| last_session_id = session_id or last_session_id | ||
| last_finish_reason = _recovery_reason | ||
| finish_warning = ( | ||
| f"CodeCome attempted {server_restart_count} server restart(s) after a " | ||
| f"'{_recovery_reason}' condition, but the restart budget is exhausted. " | ||
| "The phase cannot continue." | ||
| ) | ||
| break |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Mark stalled recoveries as non-success before the budget-exhaustion break.
session_stalled is detected from run_result, so this branch can run with returncode == RunStatus.OK. If the restart budget is already exhausted, you break here without downgrading the status, and the footer at Line 549 reports the phase as successful and returns 0 after repeated stalls.
🐛 Proposed fix
if _recovery_reason is not None:
+ if returncode == RunStatus.OK:
+ returncode = RunStatus.INCOMPLETE
if server_restart_count < max_server_restarts:
server_restart_count += 1
out.warn(
f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
f"CodeCome will restart the opencode server and retry with a fresh session "📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _recovery_reason = None | |
| if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable": | |
| _recovery_reason = "server_unreachable" | |
| elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled": | |
| _recovery_reason = "session_stalled" | |
| if _recovery_reason is not None: | |
| if server_restart_count < max_server_restarts: | |
| server_restart_count += 1 | |
| out.warn( | |
| f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} " | |
| f"CodeCome will restart the opencode server and retry with a fresh session " | |
| f"(server restart {server_restart_count}/{max_server_restarts})." | |
| ) | |
| try: | |
| server_info = _restart_server(runner, log_level=args.log_level) | |
| except ServerRunnerError as exc: | |
| _emit_fatal_error(console, "Server Restart Error", str(exc)) | |
| return int(RunStatus.ERROR) | |
| base_url = server_info.base_url | |
| last_session_id = "" | |
| prompt = build_phase_resume_prompt( | |
| str(args.phase), args.finding, | |
| _recovery_reason, step_finish_count, | |
| failure_details=phase_failures if phase_failures else None, | |
| ) | |
| continue | |
| last_session_id = session_id or last_session_id | |
| last_finish_reason = _recovery_reason | |
| finish_warning = ( | |
| f"CodeCome attempted {server_restart_count} server restart(s) after a " | |
| f"'{_recovery_reason}' condition, but the restart budget is exhausted. " | |
| "The phase cannot continue." | |
| ) | |
| break | |
| _recovery_reason = None | |
| if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable": | |
| _recovery_reason = "server_unreachable" | |
| elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled": | |
| _recovery_reason = "session_stalled" | |
| if _recovery_reason is not None: | |
| if returncode == RunStatus.OK: | |
| returncode = RunStatus.INCOMPLETE | |
| if server_restart_count < max_server_restarts: | |
| server_restart_count += 1 | |
| out.warn( | |
| f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} " | |
| f"CodeCome will restart the opencode server and retry with a fresh session " | |
| f"(server restart {server_restart_count}/{max_server_restarts})." | |
| ) | |
| try: | |
| server_info = _restart_server(runner, log_level=args.log_level) | |
| except ServerRunnerError as exc: | |
| _emit_fatal_error(console, "Server Restart Error", str(exc)) | |
| return int(RunStatus.ERROR) | |
| base_url = server_info.base_url | |
| last_session_id = "" | |
| prompt = build_phase_resume_prompt( | |
| str(args.phase), args.finding, | |
| _recovery_reason, step_finish_count, | |
| failure_details=phase_failures if phase_failures else None, | |
| ) | |
| continue | |
| last_session_id = session_id or last_session_id | |
| last_finish_reason = _recovery_reason | |
| finish_warning = ( | |
| f"CodeCome attempted {server_restart_count} server restart(s) after a " | |
| f"'{_recovery_reason}' condition, but the restart budget is exhausted. " | |
| "The phase cannot continue." | |
| ) | |
| break |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/codecome/harness.py` around lines 328 - 362, The stalled-recovery path
in the harness can break out with a success status when the restart budget is
exhausted, because `_recovery_reason` from `session_stalled` does not force a
non-success result. Update the recovery handling in `harness.py` so that the
`session_stalled` branch sets `returncode`/final status to a failure state
before the budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.
| if isinstance(exc, OpenCodeRequestError) and exc.retriable: | ||
| _record_codecome_event( | ||
| transcript, | ||
| "codecome.attempt.incomplete", | ||
| errorType=type(exc).__name__, | ||
| message=str(exc), | ||
| existingSession=bool(existing_session_id), | ||
| ) | ||
| return RunStatus.INCOMPLETE, session_id, RunResult( | ||
| last_finish_reason="server_unreachable", | ||
| last_session_id=session_id, | ||
| ), transcript.path |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don't return a recoverable result while the SSE consumer is still alive.
If consumer.join(timeout=5.0) times out, this branch still returns RunStatus.INCOMPLETE. That leaves the daemon consumer free to keep consuming/rendering the abandoned session while the harness starts a fresh attempt, especially if the prompt request timed out after the server had already accepted it.
🐛 Proposed fix
if consumer.is_alive():
_record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id)
+ raise RuntimeError(
+ "event loop thread did not stop after prompt send failure"
+ )
if isinstance(exc, OpenCodeRequestError) and exc.retriable:
_record_codecome_event(
transcript,
"codecome.attempt.incomplete",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(exc, OpenCodeRequestError) and exc.retriable: | |
| _record_codecome_event( | |
| transcript, | |
| "codecome.attempt.incomplete", | |
| errorType=type(exc).__name__, | |
| message=str(exc), | |
| existingSession=bool(existing_session_id), | |
| ) | |
| return RunStatus.INCOMPLETE, session_id, RunResult( | |
| last_finish_reason="server_unreachable", | |
| last_session_id=session_id, | |
| ), transcript.path | |
| if consumer.is_alive(): | |
| _record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id) | |
| raise RuntimeError( | |
| "event loop thread did not stop after prompt send failure" | |
| ) | |
| if isinstance(exc, OpenCodeRequestError) and exc.retriable: | |
| _record_codecome_event( | |
| transcript, | |
| "codecome.attempt.incomplete", | |
| errorType=type(exc).__name__, | |
| message=str(exc), | |
| existingSession=bool(existing_session_id), | |
| ) | |
| return RunStatus.INCOMPLETE, session_id, RunResult( | |
| last_finish_reason="server_unreachable", | |
| last_session_id=session_id, | |
| ), transcript.path |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/codecome/runner.py` around lines 297 - 308, Returning
RunStatus.INCOMPLETE from the OpenCodeRequestError retriable path can happen
while the SSE consumer thread is still alive, which lets the abandoned session
keep running as the harness retries. Update the retry/error handling in
runner.py around the consumer.join(timeout=5.0) and the OpenCodeRequestError
branch so that a timed-out or still-active consumer is first stopped/drained (or
treated as a terminal failure for this attempt) before any recoverable result is
returned. Use the existing _record_codecome_event and RunStatus/RunResult flow,
but ensure this path does not hand back a fresh retryable state until the
consumer has fully exited.
| try: | ||
| resp = urllib.request.urlopen(req, timeout=10.0) | ||
| data = json.loads(resp.read().decode("utf-8")) | ||
| break |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Close the session-create HTTP response.
create_session() reads from urlopen() without closing the response; repeated retries/runs can leak sockets. Use the same context-manager pattern used by the status/health calls.
Proposed fix
- resp = urllib.request.urlopen(req, timeout=10.0)
- data = json.loads(resp.read().decode("utf-8"))
+ with urllib.request.urlopen(req, timeout=10.0) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| resp = urllib.request.urlopen(req, timeout=10.0) | |
| data = json.loads(resp.read().decode("utf-8")) | |
| break | |
| try: | |
| with urllib.request.urlopen(req, timeout=10.0) as resp: | |
| data = json.loads(resp.read().decode("utf-8")) | |
| break |
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 225-225: Request-controlled URL passed to urlopen; validate against an allowlist to prevent SSRF.
Context: urllib.request.urlopen(req, timeout=10.0)
Note: [CWE-918] Server-Side Request Forgery (SSRF).
(urlopen-unsanitized-data)
🪛 Ruff (0.15.20)
[error] 226-226: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/codecome/session.py` around lines 225 - 228, Close the HTTP response in
create_session() when calling urllib.request.urlopen so retries don’t leak
sockets. Update the request handling in session.py to use the same
context-manager pattern already used by the status/health helpers, ensuring the
response object is always closed after reading and parsing the JSON.
| self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) | ||
| # Optional diagnostic signal: opencode may pause server.heartbeat during | ||
| # valid long turns, so heartbeat-gap stalls are disabled by default. When | ||
| # explicitly enabled, they can flag runtime blockage sooner than the | ||
| # primary no-progress timeout. | ||
| self._heartbeat_stall_timeout_s = float( | ||
| os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Guard stall timeout env parsing.
Invalid CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT values currently raise during PhaseEventLoop construction, while CODECOME_RESUME_PROBE_TIMEOUT already falls back safely.
Proposed fix
- self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+ try:
+ self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+ except (TypeError, ValueError):
+ self._stall_timeout_s = 180.0
@@
- self._heartbeat_stall_timeout_s = float(
- os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
- )
+ try:
+ self._heartbeat_stall_timeout_s = float(
+ os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
+ )
+ except (TypeError, ValueError):
+ self._heartbeat_stall_timeout_s = 0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) | |
| # Optional diagnostic signal: opencode may pause server.heartbeat during | |
| # valid long turns, so heartbeat-gap stalls are disabled by default. When | |
| # explicitly enabled, they can flag runtime blockage sooner than the | |
| # primary no-progress timeout. | |
| self._heartbeat_stall_timeout_s = float( | |
| os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") | |
| ) | |
| try: | |
| self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) | |
| except (TypeError, ValueError): | |
| self._stall_timeout_s = 180.0 | |
| # Optional diagnostic signal: opencode may pause server.heartbeat during | |
| # valid long turns, so heartbeat-gap stalls are disabled by default. When | |
| # explicitly enabled, they can flag runtime blockage sooner than the | |
| # primary no-progress timeout. | |
| try: | |
| self._heartbeat_stall_timeout_s = float( | |
| os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") | |
| ) | |
| except (TypeError, ValueError): | |
| self._heartbeat_stall_timeout_s = 0.0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/events/phase_loop.py` around lines 68 - 75, The stall timeout
environment parsing in PhaseEventLoop should be made resilient so invalid values
do not raise during construction. Update the initialization logic around the
_stall_timeout_s and _heartbeat_stall_timeout_s assignments in PhaseEventLoop to
parse with a safe fallback, matching the behavior used for
CODECOME_RESUME_PROBE_TIMEOUT. If either CODECOME_BUSY_STALL_TIMEOUT or
CODECOME_HEARTBEAT_STALL_TIMEOUT is missing or malformed, keep the default
timeout value instead of letting float() fail.
| if self._session_stalled and _last_finish_reason not in ("stop", "idle"): | ||
| _last_finish_reason = "session_stalled" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Propagate dead-process detection as a recoverable result.
When liveness_check() returns False, the loop stops without setting session_stalled or a server_unreachable finish reason. The supplied runner/harness recovery paths only restart on those recoverable signals, so a mid-run server death can fall through as a generic incomplete run instead of triggering auto-recovery.
Possible direction
self._session_stalled = False
self._session_idle_via_status = False
+ self._server_unreachable = False
@@
if self._session_stalled and _last_finish_reason not in ("stop", "idle"):
_last_finish_reason = "session_stalled"
+ if self._server_unreachable and _last_finish_reason not in ("stop", "idle"):
+ _last_finish_reason = "server_unreachable"
@@
try:
- return bool(self._liveness_check())
+ alive = bool(self._liveness_check())
+ if not alive:
+ self._server_unreachable = True
+ return alive
except Exception: # noqa: BLE001
+ self._server_unreachable = True
return FalseAlso ensure the runner maps this finish reason to the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery path before harness branching.
Also applies to: 287-290
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/events/phase_loop.py` around lines 206 - 207, The dead-process path in
phase_loop should be surfaced as a recoverable outcome instead of falling
through as a generic incomplete run. Update the liveness_check() failure
handling in phase_loop.py so a false result sets a distinct finish reason such
as server_unreachable and/or session_stalled using the existing
_last_finish_reason flow, then ensure the runner maps that finish reason into
the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery branch before
harness branching. Use the phase_loop loop logic and the runner’s
finish-reason/status conversion code to keep recovery behavior consistent.
Summary
Validation
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests