Skip to content

fix(gateway): soft-reset pre-turn gap recovery (v0.5.38)#143

Merged
royosherove merged 7 commits into
mainfrom
fix/soft-reset-pre-turn-gap
May 16, 2026
Merged

fix(gateway): soft-reset pre-turn gap recovery (v0.5.38)#143
royosherove merged 7 commits into
mainfrom
fix/soft-reset-pre-turn-gap

Conversation

@royosherove
Copy link
Copy Markdown
Member

Summary

Closes the v0.5.38 "soft-reset pre-turn gap": idle sessions that grow past the model's context window via background cron/boot/sub-agent work now recover transparently when the user's next message would overflow.

Root cause

Soft-reset infrastructure (v0.5.28–v0.5.32) only fired inside flushMemoryThenCompact's catch — i.e. when compact itself overflowed. If a session crossed the limit without ever tripping proactive soft/hard pressure thresholds, the next agent.prompt() threw prompt is too long. The gateway catch posted the raw provider error verbatim — no classification, no softReset call, no state update. Loop persisted indefinitely.

Design (codex-led)

Codex evaluated 4 alternatives, recommended D: shared reactive recoverFromContextOverflow helper used from both flushMemoryThenCompact (existing path) and the new gateway catch. Deferred-retry UX (no transparent re-prompt — unsafe with streaming + tools). pendingCompact="emergency" fallback only when agent.compact exists.

Full design doc: docs/design/v0.5.38-soft-reset-pre-turn-gap.md

Code review fixes (F1–F6)

After initial implementation, code review surfaced 6 findings; all addressed in commits 662f032 / 4df2d3d / 71c674e:

  • F1 (P1 blocker): model_error events from pi-ai streaming bypassed the gateway catch. Fix (codex-designed): classify overflow in streaming.ts and throw StreamModelOverflowError, routing into existing gateway catch. +7 tests for streaming overflow scenarios.
  • F2: Dropped dead 'cron' from TurnSource union (cron jobs run via cron/runner.ts, never handleAgentTurn).
  • F3: unsupported outcome now posts a clear "adapter doesn't support recovery" hint instead of raw provider error.
  • F4: Extracted appendCompactLog to memory/telemetry.ts (kills cross-domain coupling).
  • F5: Deduped MAX_ERROR_PREVIEW=200 between gateway.ts + overflow.ts.
  • F6: Named MAX_FAILURE_REASON_PREVIEW=100 constant for UI message slicing.

Tests

591 passing (+7 from F1 streaming integration tests). Coverage:

  • 9 gateway scenarios (overflow non-stream, overflow stream pre-text, overflow stream post-text, unsupported with/without compact, noop+failed armed-pending, non-overflow, background, concurrent-lock)
  • 7 helper-level scenarios (regression for v0.5.30 wrapped-cause + v0.5.32 non-Error throw)
  • F1 streaming-path tests for model_error → throw → gateway recovery

Verification

Bug currently live on Roy's main session (2.8MB, 211k tokens) for end-to-end validation post-merge. After ship + npm publish + gateway restart, send "hi" → expect ♻️ + ✅ recovery flow, not raw error.

Files

  • New: src/agents/shared/overflow-recovery.ts (helper, no state side effects)
  • New: src/gateway/overflow.ts (gateway catch path, free function for testability)
  • New: src/memory/telemetry.ts (extracted from lifecycle.ts)
  • New: docs/design/v0.5.38-soft-reset-pre-turn-gap.md
  • New: test/overflow-recovery.test.ts, test/gateway-overflow-recovery.test.ts
  • Edit: gateway.ts (catch wired to helper, turnSource plumbed for boot/subagent)
  • Edit: streaming.ts (model_error overflow → throw), lifecycle.ts (uses shared helper), CHANGELOG.md

~430 LOC src + ~520 LOC tests, 7 commits.

roundhouse-fix-bot and others added 7 commits May 16, 2026 20:20
Codex (gpt-5.1) recommended Alternative D: shared reactive overflow
recovery helper, deferred-retry UX, fallback to pendingCompact=emergency.

Full design + open-question resolutions in
docs/design/v0.5.38-soft-reset-pre-turn-gap.md.

Implements next: classify-in-catch + agent.softReset() + state write +
fallback when softReset is unavailable/fails.
…mpt catch (v0.5.38)

Closes the soft-reset pre-turn gap: when an idle session has already
grown past the provider context limit (e.g. via background cron/boot/
sub-agent activity that didn't trip soft/hard pressure), the next user
turn's agent.prompt() / agent.promptStream() throws 'prompt is too long'
before any pre-turn pressure handler can fire. v0.5.29-v0.5.32 only
recovered when *compact itself* overflowed; this catch path was
unprotected, so the gateway just posted the raw provider error and the
loop continued.

Changes:
- Extract recoverFromContextOverflow() into src/agents/shared/overflow-
  recovery.ts (file-private attemptSoftResetRecovery in lifecycle.ts is
  removed; lifecycle now imports the shared helper).
- Add src/gateway/overflow.ts with recoverFromAgentTurnOverflow():
  classify -> softReset -> persist memory state (forceInjectReason or
  pendingCompact='emergency' fallback when softReset is unavailable
  but compact isn't) -> append telemetry (level='gateway-overflow')
  -> post deferred-retry hint.
- Wire it into Gateway.handleAgentTurn's catch. Plumb turnSource
  ('user'|'boot'|'subagent') through so background turns get distinct
  copy ('Original work was not retried.' instead of 'please resend').
- handleStreaming() now returns hadVisibleText so the catch can
  distinguish 'please resend' from 'response was interrupted' on
  streaming overflows after partial text.
- export appendCompactLog from lifecycle so the gateway-side recovery
  uses the same compact-timing.jsonl schema.

UX: deferred retry only. No transparent same-turn replay because a
streamed turn that already emitted text or executed tools would
duplicate output / side effects on retry.

Design: docs/design/v0.5.38-soft-reset-pre-turn-gap.md (codex-cli
Alternative D).
19 new tests across two files (584 total, +19 net):

test/overflow-recovery.test.ts (8 tests) — direct unit tests on the
extracted recoverFromContextOverflow helper:
- non-overflow returns not-overflow without calling softReset
- adapter without softReset returns unsupported
- softReset success returns recovered with report + ✅ progress
- softReset reset:false returns noop with reason + ⚠️ progress
- softReset throws returns failed with message + ❌ progress
- softReset throws non-Error (string) — String() coerced, no TypeError
  (regression for v0.5.32 fix #4)
- overflow buried in .cause chain still classifies (v0.5.30 regression)
- onProgress optional — undefined callback handled

test/gateway-overflow-recovery.test.ts (11 tests) — gateway-level
integration tests on recoverFromAgentTurnOverflow with fake adapter +
fake thread, covering the brief's required surface:
- overflow during non-streaming prompt + softReset success → recovered
  hint + state.forceInjectReason='after-soft-reset' + cleared pendingCompact
- overflow + softReset undefined + no compact → sanitized error, no arming
- overflow + softReset throws + has compact → pendingCompact='emergency'
  armed + 'Recovery armed (failed: …)' hint
- overflow + softReset reset:false + has compact → pendingCompact armed
  + 'Recovery armed (noop: <reason>)' hint
- overflow + softReset fails + no compact → sanitized error, no arming
  (don't arm a flag the next turn can't honor)
- non-overflow error → sanitized error, no recovery, no state mutation
- streaming overflow before any text → '✅ Recovered. Please resend…'
- streaming overflow after partial text → '♻️ Response was interrupted'
- background turns (boot/subagent/cron) → distinct copy, no resend ask
- thread.post throws → recovery still updates state, doesn't propagate
- overflow in cause chain → still triggers recovery

All tests pass: 584 / 584.
…, v0.5.38)

pi-ai's streaming surfaces provider errors as model_error EVENTS, not
thrown exceptions. The v0.5.38 catch in Gateway.handleAgentTurn only
saw synchronous throws from agent.prompt()/promptStream(), so streamed
'prompt is too long' bypassed recoverFromAgentTurnOverflow entirely:
the model_error case posted the raw error inline, the for-await
continued to turn_end, and the function returned normally. On Telegram
(streaming-default), the v0.5.38 fix was effectively a no-op.

Per codex-cli design (see ~/.roundhouse/workspace/softreset-f1-codex-design.md):
classify the model_error message in streaming.ts via isContextOverflowError.
Non-overflow keeps the existing inline post + continue-loop. Overflow
flushes, suppresses the inline raw post, and throws a typed
StreamModelOverflowError carrying the provider message. The gateway
catch's existing recoverFromAgentTurnOverflow handles it identically to
synchronous-throw overflow — single recovery surface, no duplicate
posts, no flag plumbed through StreamResult.

7 new streaming-overflow tests:
- model_error overflow throws + suppresses raw post
- model_error non-overflow keeps inline post + continues loop (regression)
- text_delta then overflow throws with overflow message
- non-overflow inline post format unchanged
- end-to-end: overflow → recovery → 'please resend' wording
- end-to-end: partial text → recovery → 'interrupted' wording
- end-to-end: non-overflow → no recovery, raw post stands alone

591 tests passing (584 + 7).
…rflow hint (F2+F3)

F2: TurnSource included 'cron' but cron jobs run via cron/runner.ts in
their own session and never invoke Gateway.handleAgentTurn. Verified
via grep on this.handleAgentTurn — only user-message-handler,
fireBootTurn, and handleSubagentCompletion reach it. Removed 'cron'
from the union. If a future patch adds cron→main injection, add it
back at that point (YAGNI).

F3: pickUserMessage's 'unsupported' branch (no softReset on adapter,
no armed pending) returned the raw sanitized provider error
('⚠️ Error: prompt is too long: N > M'). User saw a wall-of-numbers
with no indication that recovery was unavailable. Now returns:
  '⚠️ Session full — adapter doesn't support automatic recovery.
   Run /compact manually or restart session.'

Updated:
- src/gateway/overflow.ts — TurnSource type, file header comment,
  pickUserMessage unsupported branch.
- test/gateway-overflow-recovery.test.ts — drop 'cron' from
  background-turn iteration; rewrite the 'unsupported-no-compact' test
  to assert the new clear-guidance copy and verify the raw provider
  error does NOT leak through.
- docs/design/v0.5.38-soft-reset-pre-turn-gap.md — match TurnSource shape.

591 tests passing.
F4 — appendCompactLog cross-domain import.
src/gateway/overflow.ts imported appendCompactLog from src/memory/lifecycle,
coupling gateway → memory/lifecycle (flagged as follow-up in the v0.5.38
design doc). Extracted appendCompactLog + CompactLogEntry to a new
src/memory/telemetry.ts (~43 LOC). Both lifecycle.ts and gateway/overflow.ts
now import from telemetry. lifecycle.ts re-exports the symbol for
backwards compatibility (no external callers found via grep).

F5 — MAX_ERROR_PREVIEW duplication.
The constant existed in both src/gateway/gateway.ts and src/gateway/overflow.ts.
gateway.ts had only the declaration (its sole consumer moved to overflow.ts
with the v0.5.38 catch refactor) — deleted as dead code. overflow.ts keeps
its single definition.

F6 — slice(0, 100) magic number.
pickUserMessage's failed-outcome reason truncation used a bare 100. Added
named constant MAX_FAILURE_REASON_PREVIEW = 100 with a comment explaining
why it's shorter than MAX_ERROR_PREVIEW (UI message bracketed by other
copy, not a standalone error preview).

591 tests passing.
Reflects the post-review changes:
- F1: streaming model_error overflow path now routed through the same
  recovery surface as synchronous-throw overflow (was the primary
  Telegram-path bug; pi-ai surfaces provider errors as events, not
  exceptions).
- F2: TurnSource cron removal.
- F3: clearer unsupported-overflow user message.
- F4-F6: telemetry extraction, MAX_ERROR_PREVIEW dedup, named constant
  for failure-reason preview length.
- Test count corrected to 591 (+26 net).
@royosherove royosherove merged commit 15c266e into main May 16, 2026
1 check passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9797640771

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/gateway.ts
Comment on lines 479 to +481
const streamResult = await this.handleStreaming(thread, agent.promptStream(agentThreadId, agentMessage), verboseThreads.has(agentThreadId), ac.signal);
turnUsedTools = streamResult.usedTools;
streamHadVisibleText = streamResult.hadVisibleText;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve visible-stream state when overflow aborts streaming

streamHadVisibleText is only updated from streamResult after handleStreaming(...) returns, but handleStreaming now throws StreamModelOverflowError on overflow model_error events. In the common case where some text_delta was already emitted and then overflow occurs, this assignment never runs, so the catch path always receives hadVisibleText: false and posts the "Please resend your last message" copy. That is misleading after partial output and can prompt duplicate side effects if the user resends a message whose tools already started running.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant