Skip to content

Recover from opencode stalls and control-plane timeouts#66

Open
pruiz wants to merge 2 commits into
masterfrom
fix/sse-stall-server-resilience
Open

Recover from opencode stalls and control-plane timeouts#66
pruiz wants to merge 2 commits into
masterfrom
fix/sse-stall-server-resilience

Conversation

@pruiz

@pruiz pruiz commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Summary

  • add structured run statuses and recoverable opencode server/session stall handling
  • make Phase 1 retry only the failed subphase after server restarts
  • harden SSE busy/idle handling and resume/session health checks
  • redirect Erlang crash dumps into ignored tmp/ so generated dumps do not become opencode patch candidates

Validation

  • rtk make tests
  • env BOOTSTRAP_ARGS='--keep-going' rtk make sandbox-validate
  • rtk git diff --check

Summary by CodeRabbit

  • New Features

    • Added resilience and recovery controls for handling server unavailability, session stalls, and automatic retries.
    • Improved resume behavior with clearer recovery paths and restart budgeting.
    • Added a new crash-dump location for Erlang sandboxes.
  • Bug Fixes

    • Better handling of timeouts during session creation and prompt sending.
    • Improved detection of stalled or unreachable sessions, with safer recovery and reporting.
  • Documentation

    • Expanded the README with environment variables for restart, retry, timeout, and health-check behavior.
  • Tests

    • Added coverage for resume health, reconnect behavior, recovery restarts, and sandbox crash-dump handling.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Implements end-to-end resilience for opencode serve server deaths and session stalls. Adds RunStatus enum, OpenCodeRequestError with retriable metadata, session HTTP retry/health probing, SSE per-read-tick watchdog, PhaseEventLoop busy/stall tracking, ServerRunner liveness check and restart, runner server-unreachable outcome mapping, Phase1Outcome with start_at reentry, and a harness auto-recovery restart loop bounded by CODECOME_MAX_SERVER_RESTARTS.

Changes

Resume Server-Death Resilience

Layer / File(s) Summary
RunStatus enum and OpenCodeRequestError contract
tools/codecome/status.py, tools/codecome/session.py
RunStatus IntEnum with normalize_status and OpenCodeRequestError with retriable/operation fields establish foundational types for all subsequent layers.
Session HTTP retry, health probe, and OpenCodeRequestError propagation
tools/codecome/session.py, tests/test_session.py
create_session and send_prompt_to_session retry transient failures and raise OpenCodeRequestError; get_session_status gains a timeout param; is_server_healthy added to probe /global/health; retry and exhaustion behavior tested.
SSE client read-tick, reconnect budget, and heartbeat bookkeeping
tools/events/sse_client.py, tests/test_sse_busy_resilience.py
Replaces fixed heartbeat-timeout error with configurable per-read socket timeout yielding None ticks; adds should_continue predicate and _reconnect_budget_exhausted for budget extension; adds seconds_since_heartbeat(); read-tick, reconnect, and heartbeat tests added.
PhaseEventLoop busy tracking, stall watchdog, and idle-via-status sync
tools/events/base.py, tools/events/phase_loop.py, tests/test_sse_busy_resilience.py
Adds session_stalled to RunResult; introduces _track_session_busy, _note_progress, _stalled, _should_keep_consuming, and _status_probe_reports_idle; post-loop branches sync idle confirmation or force session_stalled; _fetch_session_status added to base; stall watchdog and integration tests added.
ServerRunner liveness check, restart, and exit monitor
tools/opencode/serve.py, tests/test_runner_resume_health.py
make_liveness_check returns a proc.poll()-based callable; ServerRunner.restart() wraps stop/start; _start_exit_monitor daemon thread appends exit-code line to server log; liveness check correctness tested.
Runner liveness threading and server-unreachable outcomes
tools/codecome/runner.py, tests/test_runner_resume_health.py, tests/test_codecome_runner.py
Adds ResumeSessionServerUnreachable; threads liveness_check through _consume_events and PhaseEventLoop; reworks _wait_for_resume_idle to branch on liveness vs health; handles OpenCodeRequestError in create_session/send_prompt paths returning RunStatus.INCOMPLETE; converts final returns to RunStatus values; resume health and session-timeout tests added.
Phase 1 RunStatus migration and Phase1Outcome
tools/codecome/phase_1.py
Adds Phase1Outcome dataclass with failed_subphase and start_at parameter to run_phase_1; migrates all numeric sentinel returns to RunStatus; retry classification translates INCOMPLETE+last_finish_reason into SERVER_UNREACHABLE/SESSION_STALLED; liveness check wired into attempts.
Harness auto-recovery loop and resume prompt openers
tools/codecome/harness.py, tools/phases/completion.py, tests/test_harness_recovery_restart.py
Adds _restart_server helper and _RECOVERABLE_RESTART_REASONS; wraps Phase 1 in restart loop with start_at reentry; adds same bounded loop for phases 2–6 with session reset and resume prompt rebuild; _resume_opener_for_reason extended for server_unreachable/session_stalled; restart count and budget exhaustion tested.
Design plan, README env vars, and Erlang crash dump config
.project/resume-server-death-resilience-plan.md, README.md, templates/sandboxes/erlang-otp/docker-compose.yml, tests/test_sandbox_bootstrap.py
Full resilience design plan document added; README documents all new environment variables; ERL_CRASH_DUMP set in Erlang sandbox docker-compose and verified by test.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • pruiz/CodeCome#43: This PR extends the _wait_for_resume_idle / ResumeSessionNotReady mechanics introduced in PR #43, adding server-unreachable detection on top of the existing resume-not-ready guard.
  • pruiz/CodeCome#64: Both PRs modify _resume_opener_for_reason() to add new reason branches for different incomplete finish causes.

Poem

🐇 Hop, hop — the server fell down,
But I sniffed its pulse before it could drown!
A liveness tick, a restart or two,
The stall watchdog barked and the session came through.
Phase 1 restarts right where it went astray,
No lost transcript — a resilient day! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding recovery for opencode stalls and control-plane timeouts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sse-stall-server-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@pruiz pruiz requested a review from Copilot June 30, 2026 13:47
@github-actions

Copy link
Copy Markdown

Coverage Report

Metric Value
Line Coverage 76.9%
Lines Covered 0 / 0

Download detailed HTML coverage reports per OS/Python from the workflow artifacts.

Generated by pytest-cov on 2026-06-30T13:48:06.241Z

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CodeCome’s resilience to opencode server crashes, control-plane timeouts, and hung busy turns by introducing structured run statuses, better SSE liveness handling, and automated server restart/retry flows (including resuming Phase 1 from the failed subphase only).

Changes:

  • Adds RunStatus codes and propagates server_unreachable / session_stalled as first-class recoverable conditions throughout the runner/harness/Phase 1 orchestration.
  • Hardens SSE consumption so long busy turns aren’t abandoned (reconnect past budget while busy+alive) while still bounding genuine hangs (stall watchdog + read-tick inactivity).
  • Improves operational diagnostics and workspace hygiene (server exit-code logging; Erlang crash dump redirected into ignored tmp/; documents new env knobs).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/phases/completion.py Adds resume opener messaging for server_unreachable and session_stalled.
tools/opencode/serve.py Adds process-handle liveness checks, restart(), and a daemon exit-code log monitor.
tools/events/sse_client.py Adds read-tick inactivity handling and should_continue to avoid abandoning busy turns.
tools/events/phase_loop.py Adds busy/idle tracking, stall watchdog, status-probe idle recovery, and stall propagation in RunResult.
tools/events/base.py Adds best-effort /session/status probe helper (absence ⇒ idle).
tools/codecome/status.py Introduces shared RunStatus IntEnum and normalization helper.
tools/codecome/session.py Adds structured OpenCodeRequestError, retries for session creation, and health probing.
tools/codecome/runner.py Threads liveness checks through resume wait + event consumption; maps retriable HTTP failures to recoverable outcomes.
tools/codecome/phase_1.py Returns structured Phase 1 outcome and enables retry-from-failed-subphase after server restart.
tools/codecome/harness.py Centralizes restart+retry logic for recoverable conditions; adds restart budgets for Phase 1 and phases 2–6.
tests/test_sse_busy_resilience.py Adds coverage for reconnect budgeting, read-tick behavior, stall watchdog, and idle recovery sync.
tests/test_session.py Adds coverage for create-session retry behavior and retriable request errors.
tests/test_sandbox_bootstrap.py Verifies Erlang crash dump is redirected to ignored tmp/.
tests/test_runner_resume_health.py Tests resume readiness logic with health/liveness probes and stall propagation mapping.
tests/test_harness_recovery_restart.py Tests harness restart/retry behavior and shared restart budget exhaustion.
tests/test_codecome_runner.py Updates runner tests for recoverable prompt/session failures using OpenCodeRequestError.
templates/sandboxes/erlang-otp/docker-compose.yml Sets ERL_CRASH_DUMP to /workspace/tmp/erl_crash.dump.
README.md Documents new resilience/recovery environment variables and behavior.
.project/resume-server-death-resilience-plan.md Adds implementation plan/design notes for the resilience work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_session.py
Comment on lines 101 to 103
with pytest.raises(RuntimeError, match="empty session ID"):
module.create_session("http://localhost:8080", "1", "recon", None, None, None)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the follow-up patch: the empty-session test now expects OpenCodeRequestError and asserts retriable=False plus operation="create_session". Validated with targeted tests and make tests.

Comment on lines +68 to +75
self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
# Optional diagnostic signal: opencode may pause server.heartbeat during
# valid long turns, so heartbeat-gap stalls are disabled by default. When
# explicitly enabled, they can flag runtime blockage sooner than the
# primary no-progress timeout.
self._heartbeat_stall_timeout_s = float(
os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the follow-up patch: CODECOME_BUSY_STALL_TIMEOUT and CODECOME_HEARTBEAT_STALL_TIMEOUT now use safe float parsing with defaults, matching the defensive behavior used for the probe timeout. Added regression coverage for malformed values.

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR hardens the phase-mode execution loop against two real-world opencode failure modes — server crashes and hung model turns — by adding structured recovery paths throughout the harness, phase orchestration, and SSE client layers.

  • Server-death recovery: _wait_for_resume_idle now uses a process-handle liveness check (make_liveness_check) to distinguish a dead server from a temporarily unresponsive control plane; ServerRunner gains restart() and a daemon exit-monitor thread; harness.py and phase_1.py restart the server on SERVER_UNREACHABLE and re-enter Phase 1 at the failed subphase.
  • Stall watchdog: PhaseEventLoop tracks progress events and declares a session stalled after CODECOME_BUSY_STALL_TIMEOUT (default 180 s) of silence; a read-tick mechanism (Fix 6) makes the watchdog reachable even on a silent-but-open SSE connection where the opencode runtime has blocked its scheduler.
  • RunStatus enum replaces magic integer codes 3+ with named values SERVER_UNREACHABLE, SESSION_STALLED; shared restart budget (CODECOME_MAX_SERVER_RESTARTS) covers both conditions uniformly.

Confidence Score: 4/5

Safe to merge; all new recovery paths are protected by retry budgets and the design correctly delegates server lifecycle exclusively to the harness.

The changes are well-reasoned, thoroughly tested, and handle real production failures documented with PIDs and timestamps. The one notable imperfection is in _wait_for_resume_idle: when the server process is alive but its status endpoint is unavailable, the code records both a specific status_unavailable_process_alive event and a redundant blocked_unknown event in the same poll cycle, which could mislead transcript analysis tooling. The silent-fallback in _set_stream_timeout (can't find the socket → read-tick watchdog silently doesn't work) is another minor observability gap. Neither affects correctness of the primary recovery flows.

tools/codecome/runner.py — the _wait_for_resume_idle process-alive branch falls through to record a redundant blocked_unknown transcript event.

Important Files Changed

Filename Overview
tools/codecome/runner.py Adds ResumeSessionServerUnreachable, liveness-aware _wait_for_resume_idle with consecutive-None fast-fail, and session_stalled tracking; minor redundant event recording when process is alive but status endpoint returns None.
tools/events/sse_client.py Adds should_continue callback, read-tick mechanism via best-effort socket timeout, and reconnect-budget override; all new paths are well-guarded with exception handling.
tools/events/phase_loop.py Adds stall watchdog (_note_progress, _stalled, _should_keep_consuming), session-idle-via-status recovery path, and session_stalled propagation through RunResult; logic is correct and well-tested.
tools/opencode/serve.py Adds _start_exit_monitor daemon thread for exit-code logging, restart() convenience method, and make_liveness_check process-handle check; clean and correct.
tools/codecome/harness.py Adds _RECOVERABLE_RESTART_REASONS map, _restart_server helper, and unified recovery loops for server_unreachable and session_stalled in both Phase 1 and Phases 2-6; de-duplicated restart logic is clean.
tools/codecome/phase_1.py Introduces Phase1Outcome structured return, SERVER_UNREACHABLE/SESSION_STALLED propagation from _run_subphase, and start_at parameter for mid-phase restart re-entry; correctly delegates server lifecycle to harness.
tools/codecome/status.py Adds RunStatus IntEnum with OK/ERROR/INCOMPLETE/SERVER_UNREACHABLE/SESSION_STALLED/INTERRUPTED and normalize_status helper; replaces magic return code 3.
tests/test_sse_busy_resilience.py Comprehensive test coverage: reconnect budget, should_continue, stall watchdog truth table, read-tick interaction, heartbeat secondary signal, and idle-via-status recovery path.
tools/codecome/session.py Adds is_server_healthy helper used as fallback in _wait_for_resume_idle when no liveness_check is provided; get_session_status gains an explicit timeout parameter.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant H as Harness
    participant P1 as run_phase_1
    participant SP as _run_subphase
    participant RA as _run_single_attempt
    participant PEL as PhaseEventLoop
    participant SSE as SseClient
    participant OC as opencode serve

    H->>P1: "run_phase_1(start_at=1a)"
    P1->>SP: "_run_subphase(phase_id=1a)"
    SP->>RA: _run_single_attempt(liveness_check)
    RA->>OC: create_session / send_prompt
    RA->>PEL: _consume_events(liveness_check)
    PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
    loop SSE stream
        SSE->>OC: GET /event readline with socket timeout 10s
        alt event arrives
            OC-->>SSE: SSE event
            SSE-->>PEL: yield event
            PEL->>PEL: _note_progress / _track_session_busy
        else silent tick socket.timeout
            SSE-->>PEL: yield None
            PEL->>PEL: _should_keep_consuming stall watchdog
        end
        alt session idle
            PEL-->>RA: "RunResult session_stalled=False"
        else stall detected 180s no progress
            PEL->>PEL: _status_probe_reports_idle
            alt REST says idle
                PEL-->>RA: "RunResult idle_via_status=True"
            else still busy
                PEL-->>RA: "RunResult session_stalled=True"
            end
        else server process died
            PEL->>PEL: liveness_check False stop
            PEL-->>RA: "RunResult finish=server_unreachable"
        end
    end
    RA-->>SP: RunStatus run_result
    SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
    P1-->>H: Phase1Outcome
    alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
        H->>H: runner.restart
        H->>P1: "run_phase_1 start_at=failed_subphase"
    else
        H-->>H: return exit code
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant H as Harness
    participant P1 as run_phase_1
    participant SP as _run_subphase
    participant RA as _run_single_attempt
    participant PEL as PhaseEventLoop
    participant SSE as SseClient
    participant OC as opencode serve

    H->>P1: "run_phase_1(start_at=1a)"
    P1->>SP: "_run_subphase(phase_id=1a)"
    SP->>RA: _run_single_attempt(liveness_check)
    RA->>OC: create_session / send_prompt
    RA->>PEL: _consume_events(liveness_check)
    PEL->>SSE: "SseClient(should_continue=_should_keep_consuming)"
    loop SSE stream
        SSE->>OC: GET /event readline with socket timeout 10s
        alt event arrives
            OC-->>SSE: SSE event
            SSE-->>PEL: yield event
            PEL->>PEL: _note_progress / _track_session_busy
        else silent tick socket.timeout
            SSE-->>PEL: yield None
            PEL->>PEL: _should_keep_consuming stall watchdog
        end
        alt session idle
            PEL-->>RA: "RunResult session_stalled=False"
        else stall detected 180s no progress
            PEL->>PEL: _status_probe_reports_idle
            alt REST says idle
                PEL-->>RA: "RunResult idle_via_status=True"
            else still busy
                PEL-->>RA: "RunResult session_stalled=True"
            end
        else server process died
            PEL->>PEL: liveness_check False stop
            PEL-->>RA: "RunResult finish=server_unreachable"
        end
    end
    RA-->>SP: RunStatus run_result
    SP-->>P1: Phase1Outcome SERVER_UNREACHABLE or SESSION_STALLED
    P1-->>H: Phase1Outcome
    alt outcome is SERVER_UNREACHABLE or SESSION_STALLED
        H->>H: runner.restart
        H->>P1: "run_phase_1 start_at=failed_subphase"
    else
        H-->>H: return exit code
    end
Loading

Reviews (1): Last reviewed commit: "fix: redirect erlang crash dumps to igno..." | Re-trigger Greptile

Comment thread tools/codecome/runner.py
Comment on lines +116 to +124
if liveness_check is not None:
if liveness_check():
consecutive_unavailable = 0
_record_codecome_event(
transcript,
"codecome.resume.status_unavailable_process_alive",
sessionID=session_id,
)
else:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant blocked_unknown event when process is alive. When liveness_check() returns True and status is None, the code records status_unavailable_process_alive but then falls through unconditionally to also record blocked_unknown (since status != "busy"). This writes two semantically conflicting events for the same poll cycle — the second one is misleading because we do know the reason. Adding continue after the alive-branch avoids the duplication.

Suggested change
if liveness_check is not None:
if liveness_check():
consecutive_unavailable = 0
_record_codecome_event(
transcript,
"codecome.resume.status_unavailable_process_alive",
sessionID=session_id,
)
else:
if liveness_check is not None:
if liveness_check():
consecutive_unavailable = 0
_record_codecome_event(
transcript,
"codecome.resume.status_unavailable_process_alive",
sessionID=session_id,
)
if time.monotonic() >= deadline:
_record_codecome_event(
transcript,
"codecome.resume.timeout",
sessionID=session_id,
status=None,
timeoutSeconds=timeout_s,
)
raise ResumeSessionNotReady(
f"session {session_id} status endpoint unavailable (process alive) "
f"after {timeout_s:g}s; refusing to send resume prompt"
)
time.sleep(max(poll_s, 0.1))
continue
else:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the follow-up patch: the live-process/status-unavailable branch now records the specific event, handles timeout/sleep, and continues without falling through to blocked_unknown. Added a regression assertion for this event stream.

Comment on lines +269 to +282
def _set_stream_timeout(resp: Any, timeout: float) -> None:
"""Best-effort: set a per-read socket timeout on the response stream."""
fp = getattr(resp, "fp", None)
sock = None
if fp is not None:
raw = getattr(fp, "raw", None)
sock = getattr(raw, "_sock", None) or getattr(fp, "_sock", None)
if sock is None:
sock = getattr(resp, "_sock", None)
if sock is not None and hasattr(sock, "settimeout"):
try:
sock.settimeout(timeout)
except Exception: # noqa: BLE001
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _set_stream_timeout silently no-ops on failure with no observable signal

If the socket attribute path (resp.fp.raw._sock, resp.fp._sock, or resp._sock) can't be resolved — which can happen across Python versions, proxied connections, or alternate urllib implementations — the timeout is never set and resp.readline() blocks forever. When that happens, Fix 6 (the read-tick stall watchdog) silently stops working: no None ticks are ever yielded, _should_keep_consuming() is never re-evaluated on a silent-but-open stream, and a stalled session hangs indefinitely.

Consider logging a one-shot debug warning (e.g. to sys.stderr or a logger) when sock is None so this failure mode is detectable in production logs rather than appearing as a plain hang.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the follow-up patch as an observability improvement: _set_stream_timeout now emits a one-shot stderr warning if the socket cannot be found or settimeout fails, with regression coverage.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/codecome/harness.py`:
- Around line 328-362: The stalled-recovery path in the harness can break out
with a success status when the restart budget is exhausted, because
`_recovery_reason` from `session_stalled` does not force a non-success result.
Update the recovery handling in `harness.py` so that the `session_stalled`
branch sets `returncode`/final status to a failure state before the
budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.

In `@tools/codecome/runner.py`:
- Around line 297-308: Returning RunStatus.INCOMPLETE from the
OpenCodeRequestError retriable path can happen while the SSE consumer thread is
still alive, which lets the abandoned session keep running as the harness
retries. Update the retry/error handling in runner.py around the
consumer.join(timeout=5.0) and the OpenCodeRequestError branch so that a
timed-out or still-active consumer is first stopped/drained (or treated as a
terminal failure for this attempt) before any recoverable result is returned.
Use the existing _record_codecome_event and RunStatus/RunResult flow, but ensure
this path does not hand back a fresh retryable state until the consumer has
fully exited.

In `@tools/codecome/session.py`:
- Around line 225-228: Close the HTTP response in create_session() when calling
urllib.request.urlopen so retries don’t leak sockets. Update the request
handling in session.py to use the same context-manager pattern already used by
the status/health helpers, ensuring the response object is always closed after
reading and parsing the JSON.

In `@tools/events/phase_loop.py`:
- Around line 206-207: The dead-process path in phase_loop should be surfaced as
a recoverable outcome instead of falling through as a generic incomplete run.
Update the liveness_check() failure handling in phase_loop.py so a false result
sets a distinct finish reason such as server_unreachable and/or session_stalled
using the existing _last_finish_reason flow, then ensure the runner maps that
finish reason into the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE
recovery branch before harness branching. Use the phase_loop loop logic and the
runner’s finish-reason/status conversion code to keep recovery behavior
consistent.
- Around line 68-75: The stall timeout environment parsing in PhaseEventLoop
should be made resilient so invalid values do not raise during construction.
Update the initialization logic around the _stall_timeout_s and
_heartbeat_stall_timeout_s assignments in PhaseEventLoop to parse with a safe
fallback, matching the behavior used for CODECOME_RESUME_PROBE_TIMEOUT. If
either CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT is
missing or malformed, keep the default timeout value instead of letting float()
fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e79ed5d1-cadf-4f73-989e-1fc313b505ca

📥 Commits

Reviewing files that changed from the base of the PR and between 3855c46 and 0b13245.

📒 Files selected for processing (19)
  • .project/resume-server-death-resilience-plan.md
  • README.md
  • templates/sandboxes/erlang-otp/docker-compose.yml
  • tests/test_codecome_runner.py
  • tests/test_harness_recovery_restart.py
  • tests/test_runner_resume_health.py
  • tests/test_sandbox_bootstrap.py
  • tests/test_session.py
  • tests/test_sse_busy_resilience.py
  • tools/codecome/harness.py
  • tools/codecome/phase_1.py
  • tools/codecome/runner.py
  • tools/codecome/session.py
  • tools/codecome/status.py
  • tools/events/base.py
  • tools/events/phase_loop.py
  • tools/events/sse_client.py
  • tools/opencode/serve.py
  • tools/phases/completion.py

Comment thread tools/codecome/harness.py
Comment on lines +328 to +362
_recovery_reason = None
if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":
_recovery_reason = "server_unreachable"
elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":
_recovery_reason = "session_stalled"

if _recovery_reason is not None:
if server_restart_count < max_server_restarts:
server_restart_count += 1
out.warn(
f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
f"CodeCome will restart the opencode server and retry with a fresh session "
f"(server restart {server_restart_count}/{max_server_restarts})."
)
try:
server_info = _restart_server(runner, log_level=args.log_level)
except ServerRunnerError as exc:
_emit_fatal_error(console, "Server Restart Error", str(exc))
return int(RunStatus.ERROR)
base_url = server_info.base_url
last_session_id = ""
prompt = build_phase_resume_prompt(
str(args.phase), args.finding,
_recovery_reason, step_finish_count,
failure_details=phase_failures if phase_failures else None,
)
continue
last_session_id = session_id or last_session_id
last_finish_reason = _recovery_reason
finish_warning = (
f"CodeCome attempted {server_restart_count} server restart(s) after a "
f"'{_recovery_reason}' condition, but the restart budget is exhausted. "
"The phase cannot continue."
)
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Mark stalled recoveries as non-success before the budget-exhaustion break.

session_stalled is detected from run_result, so this branch can run with returncode == RunStatus.OK. If the restart budget is already exhausted, you break here without downgrading the status, and the footer at Line 549 reports the phase as successful and returns 0 after repeated stalls.

🐛 Proposed fix
             if _recovery_reason is not None:
+                if returncode == RunStatus.OK:
+                    returncode = RunStatus.INCOMPLETE
                 if server_restart_count < max_server_restarts:
                     server_restart_count += 1
                     out.warn(
                         f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
                         f"CodeCome will restart the opencode server and retry with a fresh session "
Please add a regression that exhausts the restart budget using only `session_stalled` results.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_recovery_reason = None
if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":
_recovery_reason = "server_unreachable"
elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":
_recovery_reason = "session_stalled"
if _recovery_reason is not None:
if server_restart_count < max_server_restarts:
server_restart_count += 1
out.warn(
f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
f"CodeCome will restart the opencode server and retry with a fresh session "
f"(server restart {server_restart_count}/{max_server_restarts})."
)
try:
server_info = _restart_server(runner, log_level=args.log_level)
except ServerRunnerError as exc:
_emit_fatal_error(console, "Server Restart Error", str(exc))
return int(RunStatus.ERROR)
base_url = server_info.base_url
last_session_id = ""
prompt = build_phase_resume_prompt(
str(args.phase), args.finding,
_recovery_reason, step_finish_count,
failure_details=phase_failures if phase_failures else None,
)
continue
last_session_id = session_id or last_session_id
last_finish_reason = _recovery_reason
finish_warning = (
f"CodeCome attempted {server_restart_count} server restart(s) after a "
f"'{_recovery_reason}' condition, but the restart budget is exhausted. "
"The phase cannot continue."
)
break
_recovery_reason = None
if returncode == RunStatus.INCOMPLETE and run_result.last_finish_reason == "server_unreachable":
_recovery_reason = "server_unreachable"
elif getattr(run_result, "session_stalled", False) or run_result.last_finish_reason == "session_stalled":
_recovery_reason = "session_stalled"
if _recovery_reason is not None:
if returncode == RunStatus.OK:
returncode = RunStatus.INCOMPLETE
if server_restart_count < max_server_restarts:
server_restart_count += 1
out.warn(
f"\n[Auto-Recovery] {_RECOVERABLE_RESTART_REASONS[_recovery_reason]} "
f"CodeCome will restart the opencode server and retry with a fresh session "
f"(server restart {server_restart_count}/{max_server_restarts})."
)
try:
server_info = _restart_server(runner, log_level=args.log_level)
except ServerRunnerError as exc:
_emit_fatal_error(console, "Server Restart Error", str(exc))
return int(RunStatus.ERROR)
base_url = server_info.base_url
last_session_id = ""
prompt = build_phase_resume_prompt(
str(args.phase), args.finding,
_recovery_reason, step_finish_count,
failure_details=phase_failures if phase_failures else None,
)
continue
last_session_id = session_id or last_session_id
last_finish_reason = _recovery_reason
finish_warning = (
f"CodeCome attempted {server_restart_count} server restart(s) after a "
f"'{_recovery_reason}' condition, but the restart budget is exhausted. "
"The phase cannot continue."
)
break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/codecome/harness.py` around lines 328 - 362, The stalled-recovery path
in the harness can break out with a success status when the restart budget is
exhausted, because `_recovery_reason` from `session_stalled` does not force a
non-success result. Update the recovery handling in `harness.py` so that the
`session_stalled` branch sets `returncode`/final status to a failure state
before the budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.

Comment thread tools/codecome/runner.py
Comment on lines +297 to +308
if isinstance(exc, OpenCodeRequestError) and exc.retriable:
_record_codecome_event(
transcript,
"codecome.attempt.incomplete",
errorType=type(exc).__name__,
message=str(exc),
existingSession=bool(existing_session_id),
)
return RunStatus.INCOMPLETE, session_id, RunResult(
last_finish_reason="server_unreachable",
last_session_id=session_id,
), transcript.path

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't return a recoverable result while the SSE consumer is still alive.

If consumer.join(timeout=5.0) times out, this branch still returns RunStatus.INCOMPLETE. That leaves the daemon consumer free to keep consuming/rendering the abandoned session while the harness starts a fresh attempt, especially if the prompt request timed out after the server had already accepted it.

🐛 Proposed fix
             if consumer.is_alive():
                 _record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id)
+                raise RuntimeError(
+                    "event loop thread did not stop after prompt send failure"
+                )
             if isinstance(exc, OpenCodeRequestError) and exc.retriable:
                 _record_codecome_event(
                     transcript,
                     "codecome.attempt.incomplete",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(exc, OpenCodeRequestError) and exc.retriable:
_record_codecome_event(
transcript,
"codecome.attempt.incomplete",
errorType=type(exc).__name__,
message=str(exc),
existingSession=bool(existing_session_id),
)
return RunStatus.INCOMPLETE, session_id, RunResult(
last_finish_reason="server_unreachable",
last_session_id=session_id,
), transcript.path
if consumer.is_alive():
_record_codecome_event(transcript, "codecome.event_loop.stop_timeout", sessionID=session_id)
raise RuntimeError(
"event loop thread did not stop after prompt send failure"
)
if isinstance(exc, OpenCodeRequestError) and exc.retriable:
_record_codecome_event(
transcript,
"codecome.attempt.incomplete",
errorType=type(exc).__name__,
message=str(exc),
existingSession=bool(existing_session_id),
)
return RunStatus.INCOMPLETE, session_id, RunResult(
last_finish_reason="server_unreachable",
last_session_id=session_id,
), transcript.path
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/codecome/runner.py` around lines 297 - 308, Returning
RunStatus.INCOMPLETE from the OpenCodeRequestError retriable path can happen
while the SSE consumer thread is still alive, which lets the abandoned session
keep running as the harness retries. Update the retry/error handling in
runner.py around the consumer.join(timeout=5.0) and the OpenCodeRequestError
branch so that a timed-out or still-active consumer is first stopped/drained (or
treated as a terminal failure for this attempt) before any recoverable result is
returned. Use the existing _record_codecome_event and RunStatus/RunResult flow,
but ensure this path does not hand back a fresh retryable state until the
consumer has fully exited.

Comment thread tools/codecome/session.py
Comment on lines +225 to +228
try:
resp = urllib.request.urlopen(req, timeout=10.0)
data = json.loads(resp.read().decode("utf-8"))
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the session-create HTTP response.

create_session() reads from urlopen() without closing the response; repeated retries/runs can leak sockets. Use the same context-manager pattern used by the status/health calls.

Proposed fix
-            resp = urllib.request.urlopen(req, timeout=10.0)
-            data = json.loads(resp.read().decode("utf-8"))
+            with urllib.request.urlopen(req, timeout=10.0) as resp:
+                data = json.loads(resp.read().decode("utf-8"))
             break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
resp = urllib.request.urlopen(req, timeout=10.0)
data = json.loads(resp.read().decode("utf-8"))
break
try:
with urllib.request.urlopen(req, timeout=10.0) as resp:
data = json.loads(resp.read().decode("utf-8"))
break
🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 225-225: Request-controlled URL passed to urlopen; validate against an allowlist to prevent SSRF.
Context: urllib.request.urlopen(req, timeout=10.0)
Note: [CWE-918] Server-Side Request Forgery (SSRF).

(urlopen-unsanitized-data)

🪛 Ruff (0.15.20)

[error] 226-226: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/codecome/session.py` around lines 225 - 228, Close the HTTP response in
create_session() when calling urllib.request.urlopen so retries don’t leak
sockets. Update the request handling in session.py to use the same
context-manager pattern already used by the status/health helpers, ensuring the
response object is always closed after reading and parsing the JSON.

Comment on lines +68 to +75
self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
# Optional diagnostic signal: opencode may pause server.heartbeat during
# valid long turns, so heartbeat-gap stalls are disabled by default. When
# explicitly enabled, they can flag runtime blockage sooner than the
# primary no-progress timeout.
self._heartbeat_stall_timeout_s = float(
os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Guard stall timeout env parsing.

Invalid CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT values currently raise during PhaseEventLoop construction, while CODECOME_RESUME_PROBE_TIMEOUT already falls back safely.

Proposed fix
-        self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+        try:
+            self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
+        except (TypeError, ValueError):
+            self._stall_timeout_s = 180.0
@@
-        self._heartbeat_stall_timeout_s = float(
-            os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
-        )
+        try:
+            self._heartbeat_stall_timeout_s = float(
+                os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
+            )
+        except (TypeError, ValueError):
+            self._heartbeat_stall_timeout_s = 0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
# Optional diagnostic signal: opencode may pause server.heartbeat during
# valid long turns, so heartbeat-gap stalls are disabled by default. When
# explicitly enabled, they can flag runtime blockage sooner than the
# primary no-progress timeout.
self._heartbeat_stall_timeout_s = float(
os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
)
try:
self._stall_timeout_s = float(os.environ.get("CODECOME_BUSY_STALL_TIMEOUT", "180"))
except (TypeError, ValueError):
self._stall_timeout_s = 180.0
# Optional diagnostic signal: opencode may pause server.heartbeat during
# valid long turns, so heartbeat-gap stalls are disabled by default. When
# explicitly enabled, they can flag runtime blockage sooner than the
# primary no-progress timeout.
try:
self._heartbeat_stall_timeout_s = float(
os.environ.get("CODECOME_HEARTBEAT_STALL_TIMEOUT", "0")
)
except (TypeError, ValueError):
self._heartbeat_stall_timeout_s = 0.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/events/phase_loop.py` around lines 68 - 75, The stall timeout
environment parsing in PhaseEventLoop should be made resilient so invalid values
do not raise during construction. Update the initialization logic around the
_stall_timeout_s and _heartbeat_stall_timeout_s assignments in PhaseEventLoop to
parse with a safe fallback, matching the behavior used for
CODECOME_RESUME_PROBE_TIMEOUT. If either CODECOME_BUSY_STALL_TIMEOUT or
CODECOME_HEARTBEAT_STALL_TIMEOUT is missing or malformed, keep the default
timeout value instead of letting float() fail.

Comment on lines +206 to +207
if self._session_stalled and _last_finish_reason not in ("stop", "idle"):
_last_finish_reason = "session_stalled"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Propagate dead-process detection as a recoverable result.

When liveness_check() returns False, the loop stops without setting session_stalled or a server_unreachable finish reason. The supplied runner/harness recovery paths only restart on those recoverable signals, so a mid-run server death can fall through as a generic incomplete run instead of triggering auto-recovery.

Possible direction
         self._session_stalled = False
         self._session_idle_via_status = False
+        self._server_unreachable = False
@@
         if self._session_stalled and _last_finish_reason not in ("stop", "idle"):
             _last_finish_reason = "session_stalled"
+        if self._server_unreachable and _last_finish_reason not in ("stop", "idle"):
+            _last_finish_reason = "server_unreachable"
@@
         try:
-            return bool(self._liveness_check())
+            alive = bool(self._liveness_check())
+            if not alive:
+                self._server_unreachable = True
+            return alive
         except Exception:  # noqa: BLE001
+            self._server_unreachable = True
             return False

Also ensure the runner maps this finish reason to the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery path before harness branching.

Also applies to: 287-290

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/events/phase_loop.py` around lines 206 - 207, The dead-process path in
phase_loop should be surfaced as a recoverable outcome instead of falling
through as a generic incomplete run. Update the liveness_check() failure
handling in phase_loop.py so a false result sets a distinct finish reason such
as server_unreachable and/or session_stalled using the existing
_last_finish_reason flow, then ensure the runner maps that finish reason into
the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE recovery branch before
harness branching. Use the phase_loop loop logic and the runner’s
finish-reason/status conversion code to keep recovery behavior consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants