fix: Gemini ACP runs recorded as false 15m timeouts#43
Conversation
✅ Deploy Preview for axis-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for axisproject ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthrough
Timeout handling switches from a fixed adapter default to Tests add a Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/unit/adapters/acp-adapter.test.ts (1)
924-1071: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueConsider moving fake timer cleanup to
afterEachfor robustness.The tests correctly call
vi.useRealTimers()at the end, but if a test fails midway, fake timers could leak. Since these tests are grouped together and each starts withvi.useFakeTimers(), this is unlikely to cause issues in practice.♻️ Suggested approach
Either add cleanup to the existing
afterEach:afterEach(() => { processKillSpy.mockRestore(); + vi.useRealTimers(); });Or wrap individual tests with try/finally:
it("...", async () => { vi.useFakeTimers(); try { // test body } finally { vi.useRealTimers(); } });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/unit/adapters/acp-adapter.test.ts` around lines 924 - 1071, Move the vi.useRealTimers() cleanup call from the end of each test method to a centralized afterEach hook to ensure fake timers are properly reset even if a test fails before reaching the cleanup code. The affected tests are the ones starting with vi.useFakeTimers() such as the tests for timeout behavior, abort signal handling, and the timeout failure scenarios. This prevents fake timer state from leaking between tests when exceptions occur during test execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/unit/adapters/acp-adapter.test.ts`:
- Around line 924-1071: Move the vi.useRealTimers() cleanup call from the end of
each test method to a centralized afterEach hook to ensure fake timers are
properly reset even if a test fails before reaching the cleanup code. The
affected tests are the ones starting with vi.useFakeTimers() such as the tests
for timeout behavior, abort signal handling, and the timeout failure scenarios.
This prevents fake timer state from leaking between tests when exceptions occur
during test execution.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1f198963-f31c-4100-90ef-d29fad26846e
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (2)
src/adapters/base/acp-adapter.tstest/unit/adapters/acp-adapter.test.ts
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
netlify/blueprints(manual)
The ACP adapter used only its own 10m DEFAULT_TIMEOUT_MS and ignored `input.timeoutMs`, so Gemini scenarios with the default 15m limit were actually killed at 10m. It also ignored `input.signal` entirely, so per-scenario/overall token limits and Ctrl-C never tore down the child. Use `input.timeoutMs ?? default` for the timeout, wire up the abort signal (SIGTERM → SIGKILL) the way the base adapter does, and attribute any observed token usage on the timeout/abort paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the gemini CLI (or a subprocess it spawned, e.g. `node --test`) lingered after the agent finished, the run could be killed by the timeout/abort right after `connection.prompt()` returned a terminal `prompt_result`/`end_turn` — and the timeout path discarded the result and token usage, turning a success into a "timeout" failure with 0 tokens and no score. Track whether the prompt reached a genuine terminal result (any stop reason other than `cancelled`). If it did, record the run as completed with its real result and token usage even when the timeout or abort fired, and normalise the child's kill-induced non-zero exit code. Runs that never reach a terminal result still fail as timeouts/aborts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ACP agents shell out to subprocesses (e.g. `node --test`) that can outlive the CLI. Killing only the CLI left those grandchildren running, leaking resources and — when they held the child's stdio pipes open — delaying the `close` event that ends the job. Spawn the CLI detached so its pid doubles as a process-group id, and add killProcessTree() to signal the whole group (with a direct child.kill fallback) at every teardown site: timeout, abort, Ctrl-C cleanup, and the normal post-prompt shutdown. The acp-adapter unit tests now stub process.kill so the group signal can't reach the real OS during the run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds regression tests for the false-timeout bug: a run whose terminal prompt_result lands after the timeout or abort fires is recorded as a completed success (real result, real token usage, exit 0); a run that never reaches a terminal result still fails as a timeout; the per-run input.timeoutMs is honoured over the adapter default; and an abort before completion tears down the whole process group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
695d46c to
98ffa1b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/adapters/base/acp-adapter.ts`:
- Around line 454-462: The `killedAfterCompletion` logic on line 456 does not
account for the SIGTERM sent by the finally block after clean completion,
causing clean completions with no assistant text to be incorrectly marked as
failed. Modify the error gate condition on line 459 (the check that includes
`exitCode !== 0 && !state.lastAssistantMessage`) to also gate on
`!completedCleanly` so that when the process completes cleanly, the non-zero
exit code from the finally block's SIGTERM is not treated as an error. This
ensures clean completions are not fabricating an error message just because of
the normal cleanup SIGTERM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f76c1aad-d89c-4f44-bf46-a98352a7a6ca
📒 Files selected for processing (2)
src/adapters/base/acp-adapter.tstest/unit/adapters/acp-adapter.test.ts
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
netlify/blueprints(manual)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/unit/adapters/acp-adapter.test.ts
A run that reached a terminal result (end_turn) but produced only tool calls — no final assistant text — left lastAssistantMessage null. The cleanup path closes stdin and SIGTERMs a lingering CLI, so the child exits non-zero, and the exit-code error gate then manufactured "Agent process exited with non-zero code", marking a successful run as failed. Gate that error (and the exit-code normalisation) on completedCleanly rather than only on a timeout/abort kill, so any non-zero/signal exit after a clean turn is treated as our own cleanup, not a failure. Addresses CodeRabbit review finding on PR #43. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A scenario time limit, token limit, or Ctrl-C is a hard cap. Previously the adapter preserved a run that completed in the same instant the timeout/abort fired as a success — but the runner already re-stamped aborts as failures, so time-limit and token-limit races resolved differently (pass vs. fail). Drop the completion-race preservation from the timeout/abort paths so a cutoff always fails, regardless of a last-instant terminal result. Token usage observed before the cut is still attributed. The original false-timeout bug stays fixed: those runs finish well within the limit (honouring input.timeoutMs) and are reaped via the process tree, so no cutoff fires for them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Some Gemini eval runs were recorded as "Scenario time limit reached (15m)" failures (exit 1, no score, 0 tokens) even though the agent completed the task — the transcripts end with a terminal
prompt_result/end_turnand real token usage.Root cause, all in the shared ACP adapter:
input.timeoutMs(the scenario limit) and used its own hardcoded 10‑minute default, so runs were killed at ~10m. The two affected runs both died at ~600s.node --test) lingering and holding the job open.Fix
input.timeoutMs(falling back to the adapter default) so runs get their full configured budget instead of a hidden 10m cap.prompt_resultis our own teardown, not a failure — even when the agent produced no final text. (Addresses the CodeRabbit review finding.)Tests
Adapter unit tests cover:
input.timeoutMsis honoured over the adapter default; an abort tears down the process group; timeout/abort cutoffs fail with token usage attributed; and a clean completion with no final text is a success despite a non‑zero exit.Builds on #44 (
isFailedRun), which scores a completed run as passed even when its process exited via signal.