Recover from opencode stalls and control-plane timeouts by pruiz · Pull Request #66 · pruiz/CodeCome

pruiz · 2026-06-30T13:45:50Z

Summary

add structured run statuses and recoverable opencode server/session stall handling
make Phase 1 retry only the failed subphase after server restarts
harden SSE busy/idle handling and resume/session health checks
redirect Erlang crash dumps into ignored tmp/ so generated dumps do not become opencode patch candidates

Validation

rtk make tests
env BOOTSTRAP_ARGS='--keep-going' rtk make sandbox-validate
rtk git diff --check

Summary by CodeRabbit

New Features
- Added resilience and recovery controls for handling server unavailability, session stalls, and automatic retries.
- Improved resume behavior with clearer recovery paths and restart budgeting.
- Added a new crash-dump location for Erlang sandboxes.
Bug Fixes
- Better handling of timeouts during session creation and prompt sending.
- Improved detection of stalled or unreachable sessions, with safer recovery and reporting.
Documentation
- Expanded the README with environment variables for restart, retry, timeout, and health-check behavior.
Tests
- Added coverage for resume health, reconnect behavior, recovery restarts, and sandbox crash-dump handling.

coderabbitai · 2026-06-30T13:46:05Z

📝 Walkthrough

Walkthrough

Implements end-to-end resilience for opencode serve server deaths and session stalls. Adds RunStatus enum, OpenCodeRequestError with retriable metadata, session HTTP retry/health probing, SSE per-read-tick watchdog, PhaseEventLoop busy/stall tracking, ServerRunner liveness check and restart, runner server-unreachable outcome mapping, Phase1Outcome with start_at reentry, and a harness auto-recovery restart loop bounded by CODECOME_MAX_SERVER_RESTARTS.

Changes

Resume Server-Death Resilience

Layer / File(s)	Summary
RunStatus enum and OpenCodeRequestError contract `tools/codecome/status.py`, `tools/codecome/session.py`	`RunStatus` IntEnum with `normalize_status` and `OpenCodeRequestError` with `retriable`/`operation` fields establish foundational types for all subsequent layers.
Session HTTP retry, health probe, and OpenCodeRequestError propagation `tools/codecome/session.py`, `tests/test_session.py`	`create_session` and `send_prompt_to_session` retry transient failures and raise `OpenCodeRequestError`; `get_session_status` gains a `timeout` param; `is_server_healthy` added to probe `/global/health`; retry and exhaustion behavior tested.
SSE client read-tick, reconnect budget, and heartbeat bookkeeping `tools/events/sse_client.py`, `tests/test_sse_busy_resilience.py`	Replaces fixed heartbeat-timeout error with configurable per-read socket timeout yielding `None` ticks; adds `should_continue` predicate and `_reconnect_budget_exhausted` for budget extension; adds `seconds_since_heartbeat()`; read-tick, reconnect, and heartbeat tests added.
PhaseEventLoop busy tracking, stall watchdog, and idle-via-status sync `tools/events/base.py`, `tools/events/phase_loop.py`, `tests/test_sse_busy_resilience.py`	Adds `session_stalled` to `RunResult`; introduces `_track_session_busy`, `_note_progress`, `_stalled`, `_should_keep_consuming`, and `_status_probe_reports_idle`; post-loop branches sync idle confirmation or force `session_stalled`; `_fetch_session_status` added to base; stall watchdog and integration tests added.
ServerRunner liveness check, restart, and exit monitor `tools/opencode/serve.py`, `tests/test_runner_resume_health.py`	`make_liveness_check` returns a `proc.poll()`-based callable; `ServerRunner.restart()` wraps stop/start; `_start_exit_monitor` daemon thread appends exit-code line to server log; liveness check correctness tested.
Runner liveness threading and server-unreachable outcomes `tools/codecome/runner.py`, `tests/test_runner_resume_health.py`, `tests/test_codecome_runner.py`	Adds `ResumeSessionServerUnreachable`; threads `liveness_check` through `_consume_events` and `PhaseEventLoop`; reworks `_wait_for_resume_idle` to branch on liveness vs health; handles `OpenCodeRequestError` in `create_session`/`send_prompt` paths returning `RunStatus.INCOMPLETE`; converts final returns to `RunStatus` values; resume health and session-timeout tests added.
Phase 1 RunStatus migration and Phase1Outcome `tools/codecome/phase_1.py`	Adds `Phase1Outcome` dataclass with `failed_subphase` and `start_at` parameter to `run_phase_1`; migrates all numeric sentinel returns to `RunStatus`; retry classification translates `INCOMPLETE+last_finish_reason` into `SERVER_UNREACHABLE`/`SESSION_STALLED`; liveness check wired into attempts.
Harness auto-recovery loop and resume prompt openers `tools/codecome/harness.py`, `tools/phases/completion.py`, `tests/test_harness_recovery_restart.py`	Adds `_restart_server` helper and `_RECOVERABLE_RESTART_REASONS`; wraps Phase 1 in restart loop with `start_at` reentry; adds same bounded loop for phases 2–6 with session reset and resume prompt rebuild; `_resume_opener_for_reason` extended for `server_unreachable`/`session_stalled`; restart count and budget exhaustion tested.
Design plan, README env vars, and Erlang crash dump config `.project/resume-server-death-resilience-plan.md`, `README.md`, `templates/sandboxes/erlang-otp/docker-compose.yml`, `tests/test_sandbox_bootstrap.py`	Full resilience design plan document added; README documents all new environment variables; `ERL_CRASH_DUMP` set in Erlang sandbox docker-compose and verified by test.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Refactor CodeCome phase execution into stateful phase classes #56: The changes directly implement the server-resilience and phase-reentry behavior proposed in that issue, including structured outcomes and restarting only the failed Phase 1 subphase.

Possibly related PRs

pruiz/CodeCome#43: This PR extends the _wait_for_resume_idle / ResumeSessionNotReady mechanics introduced in PR #43, adding server-unreachable detection on top of the existing resume-not-ready guard.
pruiz/CodeCome#64: Both PRs modify _resume_opener_for_reason() to add new reason branches for different incomplete finish causes.

Poem

🐇 Hop, hop — the server fell down,
But I sniffed its pulse before it could drown!
A liveness tick, a restart or two,
The stall watchdog barked and the session came through.
Phase 1 restarts right where it went astray,
No lost transcript — a resilient day! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding recovery for opencode stalls and control-plane timeouts.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/sse-stall-server-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-30T13:48:06Z

Coverage Report

Metric	Value
Line Coverage	76.9%
Lines Covered	0 / 0

Download detailed HTML coverage reports per OS/Python from the workflow artifacts.

Generated by pytest-cov on 2026-06-30T13:48:06.241Z

Copilot

Pull request overview

This PR improves CodeCome’s resilience to opencode server crashes, control-plane timeouts, and hung busy turns by introducing structured run statuses, better SSE liveness handling, and automated server restart/retry flows (including resuming Phase 1 from the failed subphase only).

Changes:

Adds RunStatus codes and propagates server_unreachable / session_stalled as first-class recoverable conditions throughout the runner/harness/Phase 1 orchestration.
Hardens SSE consumption so long busy turns aren’t abandoned (reconnect past budget while busy+alive) while still bounding genuine hangs (stall watchdog + read-tick inactivity).
Improves operational diagnostics and workspace hygiene (server exit-code logging; Erlang crash dump redirected into ignored tmp/; documents new env knobs).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/phases/completion.py	Adds resume opener messaging for `server_unreachable` and `session_stalled`.
tools/opencode/serve.py	Adds process-handle liveness checks, `restart()`, and a daemon exit-code log monitor.
tools/events/sse_client.py	Adds read-tick inactivity handling and `should_continue` to avoid abandoning busy turns.
tools/events/phase_loop.py	Adds busy/idle tracking, stall watchdog, status-probe idle recovery, and stall propagation in `RunResult`.
tools/events/base.py	Adds best-effort `/session/status` probe helper (absence ⇒ idle).
tools/codecome/status.py	Introduces shared `RunStatus` IntEnum and normalization helper.
tools/codecome/session.py	Adds structured `OpenCodeRequestError`, retries for session creation, and health probing.
tools/codecome/runner.py	Threads liveness checks through resume wait + event consumption; maps retriable HTTP failures to recoverable outcomes.
tools/codecome/phase_1.py	Returns structured Phase 1 outcome and enables retry-from-failed-subphase after server restart.
tools/codecome/harness.py	Centralizes restart+retry logic for recoverable conditions; adds restart budgets for Phase 1 and phases 2–6.
tests/test_sse_busy_resilience.py	Adds coverage for reconnect budgeting, read-tick behavior, stall watchdog, and idle recovery sync.
tests/test_session.py	Adds coverage for create-session retry behavior and retriable request errors.
tests/test_sandbox_bootstrap.py	Verifies Erlang crash dump is redirected to ignored `tmp/`.
tests/test_runner_resume_health.py	Tests resume readiness logic with health/liveness probes and stall propagation mapping.
tests/test_harness_recovery_restart.py	Tests harness restart/retry behavior and shared restart budget exhaustion.
tests/test_codecome_runner.py	Updates runner tests for recoverable prompt/session failures using `OpenCodeRequestError`.
templates/sandboxes/erlang-otp/docker-compose.yml	Sets `ERL_CRASH_DUMP` to `/workspace/tmp/erl_crash.dump`.
README.md	Documents new resilience/recovery environment variables and behavior.
.project/resume-server-death-resilience-plan.md	Adds implementation plan/design notes for the resilience work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pruiz · 2026-06-30T14:31:39Z

        with pytest.raises(RuntimeError, match="empty session ID"):
            module.create_session("http://localhost:8080", "1", "recon", None, None, None)



Addressed in the follow-up patch: the empty-session test now expects OpenCodeRequestError and asserts retriable=False plus operation="create_session". Validated with targeted tests and make tests.

pruiz · 2026-06-30T14:31:40Z

+        self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+        # Optional diagnostic signal: opencode may pause server.heartbeat during
+        # valid long turns, so heartbeat-gap stalls are disabled by default. When
+        # explicitly enabled, they can flag runtime blockage sooner than the
+        # primary no-progress timeout.
+        self._heartbeat_stall_timeout_s = float(
+            os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
+        )


Addressed in the follow-up patch: CODECOME_BUSY_STALL_TIMEOUT and CODECOME_HEARTBEAT_STALL_TIMEOUT now use safe float parsing with defaults, matching the defensive behavior used for the probe timeout. Added regression coverage for malformed values.

greptile-apps · 2026-06-30T13:51:53Z

Greptile Summary

This PR hardens the phase-mode execution loop against two real-world opencode failure modes — server crashes and hung model turns — by adding structured recovery paths throughout the harness, phase orchestration, and SSE client layers.

Server-death recovery: _wait_for_resume_idle now uses a process-handle liveness check (make_liveness_check) to distinguish a dead server from a temporarily unresponsive control plane; ServerRunner gains restart() and a daemon exit-monitor thread; harness.py and phase_1.py restart the server on SERVER_UNREACHABLE and re-enter Phase 1 at the failed subphase.
Stall watchdog: PhaseEventLoop tracks progress events and declares a session stalled after CODECOME_BUSY_STALL_TIMEOUT (default 180 s) of silence; a read-tick mechanism (Fix 6) makes the watchdog reachable even on a silent-but-open SSE connection where the opencode runtime has blocked its scheduler.
RunStatus enum replaces magic integer codes 3+ with named values SERVER_UNREACHABLE, SESSION_STALLED; shared restart budget (CODECOME_MAX_SERVER_RESTARTS) covers both conditions uniformly.

Confidence Score: 4/5

Safe to merge; all new recovery paths are protected by retry budgets and the design correctly delegates server lifecycle exclusively to the harness.

The changes are well-reasoned, thoroughly tested, and handle real production failures documented with PIDs and timestamps. The one notable imperfection is in _wait_for_resume_idle: when the server process is alive but its status endpoint is unavailable, the code records both a specific status_unavailable_process_alive event and a redundant blocked_unknown event in the same poll cycle, which could mislead transcript analysis tooling. The silent-fallback in _set_stream_timeout (can't find the socket → read-tick watchdog silently doesn't work) is another minor observability gap. Neither affects correctness of the primary recovery flows.

tools/codecome/runner.py — the _wait_for_resume_idle process-alive branch falls through to record a redundant blocked_unknown transcript event.

Important Files Changed

Filename	Overview
tools/codecome/runner.py	Adds ResumeSessionServerUnreachable, liveness-aware _wait_for_resume_idle with consecutive-None fast-fail, and session_stalled tracking; minor redundant event recording when process is alive but status endpoint returns None.
tools/events/sse_client.py	Adds should_continue callback, read-tick mechanism via best-effort socket timeout, and reconnect-budget override; all new paths are well-guarded with exception handling.
tools/events/phase_loop.py	Adds stall watchdog (_note_progress, _stalled, _should_keep_consuming), session-idle-via-status recovery path, and session_stalled propagation through RunResult; logic is correct and well-tested.
tools/opencode/serve.py	Adds _start_exit_monitor daemon thread for exit-code logging, restart() convenience method, and make_liveness_check process-handle check; clean and correct.
tools/codecome/harness.py	Adds _RECOVERABLE_RESTART_REASONS map, _restart_server helper, and unified recovery loops for server_unreachable and session_stalled in both Phase 1 and Phases 2-6; de-duplicated restart logic is clean.
tools/codecome/phase_1.py	Introduces Phase1Outcome structured return, SERVER_UNREACHABLE/SESSION_STALLED propagation from _run_subphase, and start_at parameter for mid-phase restart re-entry; correctly delegates server lifecycle to harness.
tools/codecome/status.py	Adds RunStatus IntEnum with OK/ERROR/INCOMPLETE/SERVER_UNREACHABLE/SESSION_STALLED/INTERRUPTED and normalize_status helper; replaces magic return code 3.
tests/test_sse_busy_resilience.py	Comprehensive test coverage: reconnect budget, should_continue, stall watchdog truth table, read-tick interaction, heartbeat secondary signal, and idle-via-status recovery path.
tools/codecome/session.py	Adds is_server_healthy helper used as fallback in _wait_for_resume_idle when no liveness_check is provided; get_session_status gains an explicit timeout parameter.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant H as Harness
    participant P1 as run_phase_1
    participant SP as _run_subphase
    participant RA as _run_single_attempt
    participant PEL as PhaseEventLoop
    participant SSE as SseClient
    participant OC as opencode serve

    H->>P1: "run_phase_1(start_at=1a)"
    P1->>SP: "_run_subphase(phase_id=1a)"
    SP->>RA: _run_single_attempt(liveness_check)
    RA->>OC: create_session / send_prompt
    RA->>PEL: _consume_events(liveness_check)
    PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
    loop SSE stream
        SSE->>OC: GET /event readline with socket timeout 10s
        alt event arrives
            OC-->>SSE: SSE event
            SSE-->>PEL: yield event
            PEL->>PEL: _note_progress / _track_session_busy
        else silent tick socket.timeout
            SSE-->>PEL: yield None
            PEL->>PEL: _should_keep_consuming stall watchdog
        end
        alt session idle
            PEL-->>RA: "RunResult session_stalled=False"
        else stall detected 180s no progress
            PEL->>PEL: _status_probe_reports_idle
            alt REST says idle
                PEL-->>RA: "RunResult idle_via_status=True"
            else still busy
                PEL-->>RA: "RunResult session_stalled=True"
            end
        else server process died
            PEL->>PEL: liveness_check False stop
            PEL-->>RA: "RunResult finish=server_unreachable"
        end
    end
    RA-->>SP: RunStatus run_result
    SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
    P1-->>H: Phase1Outcome
    alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
        H->>H: runner.restart
        H->>P1: "run_phase_1 start_at=failed_subphase"
    else
        H-->>H: return exit code
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant H as Harness
    participant P1 as run_phase_1
    participant SP as _run_subphase
    participant RA as _run_single_attempt
    participant PEL as PhaseEventLoop
    participant SSE as SseClient
    participant OC as opencode serve

    H->>P1: "run_phase_1(start_at=1a)"
    P1->>SP: "_run_subphase(phase_id=1a)"
    SP->>RA: _run_single_attempt(liveness_check)
    RA->>OC: create_session / send_prompt
    RA->>PEL: _consume_events(liveness_check)
    PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
    loop SSE stream
        SSE->>OC: GET /event readline with socket timeout 10s
        alt event arrives
            OC-->>SSE: SSE event
            SSE-->>PEL: yield event
            PEL->>PEL: _note_progress / _track_session_busy
        else silent tick socket.timeout
            SSE-->>PEL: yield None
            PEL->>PEL: _should_keep_consuming stall watchdog
        end
        alt session idle
            PEL-->>RA: "RunResult session_stalled=False"
        else stall detected 180s no progress
            PEL->>PEL: _status_probe_reports_idle
            alt REST says idle
                PEL-->>RA: "RunResult idle_via_status=True"
            else still busy
                PEL-->>RA: "RunResult session_stalled=True"
            end
        else server process died
            PEL->>PEL: liveness_check False stop
            PEL-->>RA: "RunResult finish=server_unreachable"
        end
    end
    RA-->>SP: RunStatus run_result
    SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
    P1-->>H: Phase1Outcome
    alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
        H->>H: runner.restart
        H->>P1: "run_phase_1 start_at=failed_subphase"
    else
        H-->>H: return exit code
    end

_{Reviews (1): Last reviewed commit: "fix: redirect erlang crash dumps to igno..." | Re-trigger Greptile}

greptile-apps · 2026-06-30T13:51:56Z

+            if liveness_check is not None:
+                if liveness_check():
+                    consecutive_unavailable = 0
+                    _record_codecome_event(
+                        transcript,
+                        "codecome.resume.status_unavailable_process_alive",
+                        sessionID=session_id,
+                    )
+                else:


Redundant blocked_unknown event when process is alive. When liveness_check() returns True and status is None, the code records status_unavailable_process_alive but then falls through unconditionally to also record blocked_unknown (since status != "busy"). This writes two semantically conflicting events for the same poll cycle — the second one is misleading because we do know the reason. Adding continue after the alive-branch avoids the duplication.

Suggested change

if liveness_check is not None:

if liveness_check():

consecutive_unavailable = 0

_record_codecome_event(

transcript,

"codecome.resume.status_unavailable_process_alive",

sessionID=session_id,

)

else:

if liveness_check is not None:

if liveness_check():

consecutive_unavailable = 0

_record_codecome_event(

transcript,

"codecome.resume.status_unavailable_process_alive",

sessionID=session_id,

)

if time.monotonic() >= deadline:

_record_codecome_event(

transcript,

"codecome.resume.timeout",

sessionID=session_id,

status=None,

timeoutSeconds=timeout_s,

)

raise ResumeSessionNotReady(

f"session {session_id} status endpoint unavailable (process alive) "

f"after {timeout_s:g}s; refusing to send resume prompt"

)

time.sleep(max(poll_s, 0.1))

continue

else:

Addressed in the follow-up patch: the live-process/status-unavailable branch now records the specific event, handles timeout/sleep, and continues without falling through to blocked_unknown. Added a regression assertion for this event stream.

greptile-apps · 2026-06-30T13:51:57Z

+    def _set_stream_timeout(resp: Any, timeout: float) -> None:
+        """Best-effort: set a per-read socket timeout on the response stream."""
+        fp = getattr(resp, "fp", None)
+        sock = None
+        if fp is not None:
+            raw = getattr(fp, "raw", None)
+            sock = getattr(raw, "_sock", None) or getattr(fp, "_sock", None)
+        if sock is None:
+            sock = getattr(resp, "_sock", None)
+        if sock is not None and hasattr(sock, "settimeout"):
+            try:
+                sock.settimeout(timeout)
+            except Exception:  # noqa: BLE001
+                pass


_set_stream_timeout silently no-ops on failure with no observable signal

If the socket attribute path (resp.fp.raw._sock, resp.fp._sock, or resp._sock) can't be resolved — which can happen across Python versions, proxied connections, or alternate urllib implementations — the timeout is never set and resp.readline() blocks forever. When that happens, Fix 6 (the read-tick stall watchdog) silently stops working: no None ticks are ever yielded, _should_keep_consuming() is never re-evaluated on a silent-but-open stream, and a stalled session hangs indefinitely.

Consider logging a one-shot debug warning (e.g. to sys.stderr or a logger) when sock is None so this failure mode is detectable in production logs rather than appearing as a plain hang.

Addressed in the follow-up patch as an observability improvement: _set_stream_timeout now emits a one-shot stderr warning if the socket cannot be found or settimeout fails, with regression coverage.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/codecome/harness.py`:
- Around line 328-362: The stalled-recovery path in the harness can break out
with a success status when the restart budget is exhausted, because
`_recovery_reason` from `session_stalled` does not force a non-success result.
Update the recovery handling in `harness.py` so that the `session_stalled`
branch sets `returncode`/final status to a failure state before the
budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.

In `@tools/codecome/runner.py`:
- Around line 297-308: Returning RunStatus.INCOMPLETE from the
OpenCodeRequestError retriable path can happen while the SSE consumer thread is
still alive, which lets the abandoned session keep running as the harness
retries. Update the retry/error handling in runner.py around the
consumer.join(timeout=5.0) and the OpenCodeRequestError branch so that a
timed-out or still-active consumer is first stopped/drained (or treated as a
terminal failure for this attempt) before any recoverable result is returned.
Use the existing _record_codecome_event and RunStatus/RunResult flow, but ensure
this path does not hand back a fresh retryable state until the consumer has
fully exited.

In `@tools/codecome/session.py`:
- Around line 225-228: Close the HTTP response in create_session() when calling
urllib.request.urlopen so retries don’t leak sockets. Update the request
handling in session.py to use the same context-manager pattern already used by
the status/health helpers, ensuring the response object is always closed after
reading and parsing the JSON.

In `@tools/events/phase_loop.py`:
- Around line 206-207: The dead-process path in phase_loop should be surfaced as
a recoverable outcome instead of falling through as a generic incomplete run.
Update the liveness_check() failure handling in phase_loop.py so a false result
sets a distinct finish reason such as server_unreachable and/or session_stalled
using the existing _last_finish_reason flow, then ensure the runner maps that
finish reason into the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE
recovery branch before harness branching. Use the phase_loop loop logic and the
runner’s finish-reason/status conversion code to keep recovery behavior
consistent.
- Around line 68-75: The stall timeout environment parsing in PhaseEventLoop
should be made resilient so invalid values do not raise during construction.
Update the initialization logic around the _stall_timeout_s and
_heartbeat_stall_timeout_s assignments in PhaseEventLoop to parse with a safe
fallback, matching the behavior used for CODECOME_RESUME_PROBE_TIMEOUT. If
either CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT is
missing or malformed, keep the default timeout value instead of letting float()
fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e79ed5d1-cadf-4f73-989e-1fc313b505ca

📥 Commits

Reviewing files that changed from the base of the PR and between 3855c46 and 0b13245.

📒 Files selected for processing (19)

.project/resume-server-death-resilience-plan.md
README.md
templates/sandboxes/erlang-otp/docker-compose.yml
tests/test_codecome_runner.py
tests/test_harness_recovery_restart.py
tests/test_runner_resume_health.py
tests/test_sandbox_bootstrap.py
tests/test_session.py
tests/test_sse_busy_resilience.py
tools/codecome/harness.py
tools/codecome/phase_1.py
tools/codecome/runner.py
tools/codecome/session.py
tools/codecome/status.py
tools/events/base.py
tools/events/phase_loop.py
tools/events/sse_client.py
tools/opencode/serve.py
tools/phases/completion.py

coderabbitai · 2026-06-30T13:57:13Z

+            _recovery_reason = None
+            if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":
+                _recovery_reason = "server_unreachable"
+            elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":
+                _recovery_reason = "session_stalled"
+
+            if _recovery_reason is not None:
+                if server_restart_count < max_server_restarts:
+                    server_restart_count += 1
+                    out.warn(
+                        f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
+                        f"CodeCome will restart the opencode server and retry with a fresh session "
+                        f"(server restart {server_restart_count}/{max_server_restarts})."
+                    )
+                    try:
+                        server_info = _restart_server(runner, log_level=args.log_level)
+                    except ServerRunnerError as exc:
+                        _emit_fatal_error(console, "Server Restart Error", str(exc))
+                        return int(RunStatus.ERROR)
+                    base_url = server_info.base_url
+                    last_session_id = ""
+                    prompt = build_phase_resume_prompt(
+                        str(args.phase), args.finding,
+                        _recovery_reason, step_finish_count,
+                        failure_details=phase_failures if phase_failures else None,
+                    )
+                    continue
+                last_session_id = session_id or last_session_id
+                last_finish_reason = _recovery_reason
+                finish_warning = (
+                    f"CodeCome attempted {server_restart_count} server restart(s) after a "
+                    f"'{_recovery_reason}' condition, but the restart budget is exhausted. "
+                    "The phase cannot continue."
+                )
+                break


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Mark stalled recoveries as non-success before the budget-exhaustion break.

session_stalled is detected from run_result, so this branch can run with returncode == RunStatus.OK. If the restart budget is already exhausted, you break here without downgrading the status, and the footer at Line 549 reports the phase as successful and returns 0 after repeated stalls.

🐛 Proposed fix

if _recovery_reason is not None: + if returncode == RunStatus.OK: + returncode = RunStatus.INCOMPLETE if server_restart_count < max_server_restarts: server_restart_count += 1 out.warn( f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} " f"CodeCome will restart the opencode server and retry with a fresh session "

Please add a regression that exhausts the restart budget using only `session_stalled` results.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_recovery_reason = None

if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":

_recovery_reason = "server_unreachable"

elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":

_recovery_reason = "session_stalled"

if _recovery_reason is not None:

if server_restart_count < max_server_restarts:

server_restart_count += 1

out.warn(

f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "

f"CodeCome will restart the opencode server and retry with a fresh session "

f"(server restart {server_restart_count}/{max_server_restarts})."

)

try:

server_info = _restart_server(runner, log_level=args.log_level)

except ServerRunnerError as exc:

_emit_fatal_error(console, "Server Restart Error", str(exc))

return int(RunStatus.ERROR)

base_url = server_info.base_url

last_session_id = ""

prompt = build_phase_resume_prompt(

str(args.phase), args.finding,

_recovery_reason, step_finish_count,

failure_details=phase_failures if phase_failures else None,

)

continue

last_session_id = session_id or last_session_id

last_finish_reason = _recovery_reason

finish_warning = (

f"CodeCome attempted {server_restart_count} server restart(s) after a "

f"'{_recovery_reason}' condition, but the restart budget is exhausted. "

"The phase cannot continue."

)

break

_recovery_reason = None

if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":

_recovery_reason = "server_unreachable"

elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":

_recovery_reason = "session_stalled"

if _recovery_reason is not None:

if returncode == RunStatus.OK:

returncode = RunStatus.INCOMPLETE

if server_restart_count < max_server_restarts:

server_restart_count += 1

out.warn(

f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "

f"CodeCome will restart the opencode server and retry with a fresh session "

f"(server restart {server_restart_count}/{max_server_restarts})."

)

try:

server_info = _restart_server(runner, log_level=args.log_level)

except ServerRunnerError as exc:

_emit_fatal_error(console, "Server Restart Error", str(exc))

return int(RunStatus.ERROR)

base_url = server_info.base_url

last_session_id = ""

prompt = build_phase_resume_prompt(

str(args.phase), args.finding,

_recovery_reason, step_finish_count,

failure_details=phase_failures if phase_failures else None,

)

continue

last_session_id = session_id or last_session_id

last_finish_reason = _recovery_reason

finish_warning = (

f"CodeCome attempted {server_restart_count} server restart(s) after a "

f"'{_recovery_reason}' condition, but the restart budget is exhausted. "

"The phase cannot continue."

)

break

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/codecome/harness.py` around lines 328 - 362, The stalled-recovery path in the harness can break out with a success status when the restart budget is exhausted, because `_recovery_reason` from `session_stalled` does not force a non-success result. Update the recovery handling in `harness.py` so that the `session_stalled` branch sets `returncode`/final status to a failure state before the budget-exhaustion `break`, and ensure the footer/exit logic driven by `finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in this case. Add a regression around the restart loop in the main run path that exhausts the budget using only `session_stalled` outcomes and verifies a non-zero exit.

coderabbitai · 2026-06-30T13:57:14Z

+            if isinstance(exc, OpenCodeRequestError) and exc.retriable:
+                _record_codecome_event(
+                    transcript,
+                    "codecome.attempt.incomplete",
+                    errorType=type(exc).__name__,
+                    message=str(exc),
+                    existingSession=bool(existing_session_id),
+                )
+                return RunStatus.INCOMPLETE, session_id, RunResult(
+                    last_finish_reason="server_unreachable",
+                    last_session_id=session_id,
+                ), transcript.path


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't return a recoverable result while the SSE consumer is still alive.

If consumer.join(timeout=5.0) times out, this branch still returns RunStatus.INCOMPLETE. That leaves the daemon consumer free to keep consuming/rendering the abandoned session while the harness starts a fresh attempt, especially if the prompt request timed out after the server had already accepted it.

🐛 Proposed fix

if consumer.is_alive(): _record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id) + raise RuntimeError( + "event loop thread did not stop after prompt send failure" + ) if isinstance(exc, OpenCodeRequestError) and exc.retriable: _record_codecome_event( transcript, "codecome.attempt.incomplete",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(exc, OpenCodeRequestError) and exc.retriable:

_record_codecome_event(

transcript,

"codecome.attempt.incomplete",

errorType=type(exc).__name__,

message=str(exc),

existingSession=bool(existing_session_id),

)

return RunStatus.INCOMPLETE, session_id, RunResult(

last_finish_reason="server_unreachable",

last_session_id=session_id,

), transcript.path

if consumer.is_alive():

_record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id)

raise RuntimeError(

"event loop thread did not stop after prompt send failure"

)

if isinstance(exc, OpenCodeRequestError) and exc.retriable:

_record_codecome_event(

transcript,

"codecome.attempt.incomplete",

errorType=type(exc).__name__,

message=str(exc),

existingSession=bool(existing_session_id),

)

return RunStatus.INCOMPLETE, session_id, RunResult(

last_finish_reason="server_unreachable",

last_session_id=session_id,

), transcript.path

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/codecome/runner.py` around lines 297 - 308, Returning RunStatus.INCOMPLETE from the OpenCodeRequestError retriable path can happen while the SSE consumer thread is still alive, which lets the abandoned session keep running as the harness retries. Update the retry/error handling in runner.py around the consumer.join(timeout=5.0) and the OpenCodeRequestError branch so that a timed-out or still-active consumer is first stopped/drained (or treated as a terminal failure for this attempt) before any recoverable result is returned. Use the existing _record_codecome_event and RunStatus/RunResult flow, but ensure this path does not hand back a fresh retryable state until the consumer has fully exited.

coderabbitai · 2026-06-30T13:57:14Z

+        try:
+            resp = urllib.request.urlopen(req, timeout=10.0)
+            data = json.loads(resp.read().decode("utf-8"))
+            break


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the session-create HTTP response.

create_session() reads from urlopen() without closing the response; repeated retries/runs can leak sockets. Use the same context-manager pattern used by the status/health calls.

Proposed fix

- resp = urllib.request.urlopen(req, timeout=10.0) - data = json.loads(resp.read().decode("utf-8")) + with urllib.request.urlopen(req, timeout=10.0) as resp: + data = json.loads(resp.read().decode("utf-8")) break

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

resp = urllib.request.urlopen(req, timeout=10.0)

data = json.loads(resp.read().decode("utf-8"))

break

try:

with urllib.request.urlopen(req, timeout=10.0) as resp:

data = json.loads(resp.read().decode("utf-8"))

break

🧰 Tools

🪛 ast-grep (0.44.0)

[warning] 225-225: Request-controlled URL passed to urlopen; validate against an allowlist to prevent SSRF.
Context: urllib.request.urlopen(req, timeout=10.0)
Note: [CWE-918] Server-Side Request Forgery (SSRF).

(urlopen-unsanitized-data)

🪛 Ruff (0.15.20)

[error] 226-226: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/codecome/session.py` around lines 225 - 228, Close the HTTP response in create_session() when calling urllib.request.urlopen so retries don’t leak sockets. Update the request handling in session.py to use the same context-manager pattern already used by the status/health helpers, ensuring the response object is always closed after reading and parsing the JSON.

coderabbitai · 2026-06-30T13:57:14Z

+        self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+        # Optional diagnostic signal: opencode may pause server.heartbeat during
+        # valid long turns, so heartbeat-gap stalls are disabled by default. When
+        # explicitly enabled, they can flag runtime blockage sooner than the
+        # primary no-progress timeout.
+        self._heartbeat_stall_timeout_s = float(
+            os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
+        )


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Guard stall timeout env parsing.

Invalid CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT values currently raise during PhaseEventLoop construction, while CODECOME_RESUME_PROBE_TIMEOUT already falls back safely.

Proposed fix

- self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) + try: + self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180")) + except (TypeError, ValueError): + self._stall_timeout_s = 180.0 @@ - self._heartbeat_stall_timeout_s = float( - os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") - ) + try: + self._heartbeat_stall_timeout_s = float( + os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0") + ) + except (TypeError, ValueError): + self._heartbeat_stall_timeout_s = 0.0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))

# Optional diagnostic signal: opencode may pause server.heartbeat during

# valid long turns, so heartbeat-gap stalls are disabled by default. When

# explicitly enabled, they can flag runtime blockage sooner than the

# primary no-progress timeout.

self._heartbeat_stall_timeout_s = float(

os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")

)

try:

self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))

except (TypeError, ValueError):

self._stall_timeout_s = 180.0

# Optional diagnostic signal: opencode may pause server.heartbeat during

# valid long turns, so heartbeat-gap stalls are disabled by default. When

# explicitly enabled, they can flag runtime blockage sooner than the

# primary no-progress timeout.

try:

self._heartbeat_stall_timeout_s = float(

os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")

)

except (TypeError, ValueError):

self._heartbeat_stall_timeout_s = 0.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/events/phase_loop.py` around lines 68 - 75, The stall timeout environment parsing in PhaseEventLoop should be made resilient so invalid values do not raise during construction. Update the initialization logic around the _stall_timeout_s and _heartbeat_stall_timeout_s assignments in PhaseEventLoop to parse with a safe fallback, matching the behavior used for CODECOME_RESUME_PROBE_TIMEOUT. If either CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT is missing or malformed, keep the default timeout value instead of letting float() fail.

coderabbitai · 2026-06-30T13:57:14Z

+        if self._session_stalled and _last_finish_reason not in ("stop", "idle"):
+            _last_finish_reason = "session_stalled"


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Propagate dead-process detection as a recoverable result.

When liveness_check() returns False, the loop stops without setting session_stalled or a server_unreachable finish reason. The supplied runner/harness recovery paths only restart on those recoverable signals, so a mid-run server death can fall through as a generic incomplete run instead of triggering auto-recovery.

Possible direction

self._session_stalled = False self._session_idle_via_status = False + self._server_unreachable = False @@ if self._session_stalled and _last_finish_reason not in ("stop", "idle"): _last_finish_reason = "session_stalled" + if self._server_unreachable and _last_finish_reason not in ("stop", "idle"): + _last_finish_reason = "server_unreachable" @@ try: - return bool(self._liveness_check()) + alive = bool(self._liveness_check()) + if not alive: + self._server_unreachable = True + return alive except Exception: # noqa: BLE001 + self._server_unreachable = True return False

Also ensure the runner maps this finish reason to the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery path before harness branching.

Also applies to: 287-290

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/events/phase_loop.py` around lines 206 - 207, The dead-process path in phase_loop should be surfaced as a recoverable outcome instead of falling through as a generic incomplete run. Update the liveness_check() failure handling in phase_loop.py so a false result sets a distinct finish reason such as server_unreachable and/or session_stalled using the existing _last_finish_reason flow, then ensure the runner maps that finish reason into the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery branch before harness branching. Use the phase_loop loop logic and the runner’s finish-reason/status conversion code to keep recovery behavior consistent.

pruiz added 2 commits June 24, 2026 13:12

fix: recover from opencode stalls and control-plane timeouts

71f864e

fix: redirect erlang crash dumps to ignored tmp

0b13245

pruiz requested a review from Copilot June 30, 2026 13:47

Copilot started reviewing on behalf of pruiz June 30, 2026 13:48 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

		with pytest.raises(RuntimeError, match="empty session ID"):
		module.create_session("http://localhost:8080", "1", "recon", None, None, None)

		if self._session_stalled and _last_finish_reason not in ("stop", "idle"):
		_last_finish_reason = "session_stalled"

Conversation

pruiz commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 30, 2026

Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pruiz Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

pruiz Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 30, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

pruiz Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

pruiz Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pruiz commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading