Skip to content

fix: Gemini ACP runs recorded as false 15m timeouts#43

Merged
seancdavis merged 7 commits into
mainfrom
fix/gemini-acp-completion-timeout
Jun 23, 2026
Merged

fix: Gemini ACP runs recorded as false 15m timeouts#43
seancdavis merged 7 commits into
mainfrom
fix/gemini-acp-completion-timeout

Conversation

@seancdavis

@seancdavis seancdavis commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Problem

Some Gemini eval runs were recorded as "Scenario time limit reached (15m)" failures (exit 1, no score, 0 tokens) even though the agent completed the task — the transcripts end with a terminal prompt_result/end_turn and real token usage.

Root cause, all in the shared ACP adapter:

  • It ignored input.timeoutMs (the scenario limit) and used its own hardcoded 10‑minute default, so runs were killed at ~10m. The two affected runs both died at ~600s.
  • It never handled the runner's abort signal, and only killed the CLI process — leaving subprocesses it spawned (e.g. node --test) lingering and holding the job open.
  • A clean turn that produced only tool calls (no final assistant text) was marked failed: cleanup SIGTERMs the lingering CLI → non‑zero exit → a fabricated "exited with non-zero code" error.

Fix

  • Honour input.timeoutMs (falling back to the adapter default) so runs get their full configured budget instead of a hidden 10m cap.
  • Tear down the process tree: spawn the CLI detached and signal the whole process group on completion/timeout/abort, so lingering subprocesses are reaped and can't hold the job open.
  • Wire up the abort signal (SIGTERM → SIGKILL), matching the base adapter, so token/run limits and Ctrl‑C actually stop the agent.
  • Don't fail a clean completion for a cleanup exit code: a non‑zero/signal exit after a terminal prompt_result is our own teardown, not a failure — even when the agent produced no final text. (Addresses the CodeRabbit review finding.)
  • Treat a hard cutoff as a failure: a time limit, token limit, or Ctrl‑C fails the run even if a terminal result lands in the same instant — but the token usage observed before the cut is still attributed.

Tests

Adapter unit tests cover: input.timeoutMs is honoured over the adapter default; an abort tears down the process group; timeout/abort cutoffs fail with token usage attributed; and a clean completion with no final text is a success despite a non‑zero exit.

Builds on #44 (isFailedRun), which scores a completed run as passed even when its process exited via signal.

@netlify

netlify Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploy Preview for axis-docs ready!

Name Link
🔨 Latest commit 68f837e
🔍 Latest deploy log https://app.netlify.com/projects/axis-docs/deploys/6a3acc997ef33c0008dfd958
😎 Deploy Preview https://deploy-preview-43--axis-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify

netlify Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploy Preview for axisproject ready!

Name Link
🔨 Latest commit 68f837e
🔍 Latest deploy log https://app.netlify.com/projects/axisproject/deploys/6a3acc99638c470008fee49e
😎 Deploy Preview https://deploy-preview-43--axisproject.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

createAcpBasedAdapter's run() now spawns ACP CLI child processes with detached: true to form a process group. A new killProcessTree helper signals the entire group via process.kill(-pid, signal) with a direct-child fallback, and the Ctrl-C cleanup path is updated to use it.

Timeout handling switches from a fixed adapter default to input.timeoutMs (with adapter default fallback). A new input.signal abort path is added, tracking abortedBySignal/abortReason and scheduling SIGTERM→SIGKILL. A completedCleanly flag (set when stopReason !== "cancelled") prevents timeout or abort teardown from overriding a result that already completed successfully. Exit codes and errors are normalized when the child was killed after clean completion.

Tests add a process.kill spy and four race-condition scenarios covering completion-before-timeout, completion-before-abort, timeout without completion, and abort-before-completion process-group signaling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: resolving false 15-minute timeout recordings in Gemini ACP runs that had actually completed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the problem, root causes, and the implemented fix with specific details about timeouts, process management, and result handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gemini-acp-completion-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/unit/adapters/acp-adapter.test.ts (1)

924-1071: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider moving fake timer cleanup to afterEach for robustness.

The tests correctly call vi.useRealTimers() at the end, but if a test fails midway, fake timers could leak. Since these tests are grouped together and each starts with vi.useFakeTimers(), this is unlikely to cause issues in practice.

♻️ Suggested approach

Either add cleanup to the existing afterEach:

  afterEach(() => {
    processKillSpy.mockRestore();
+   vi.useRealTimers();
  });

Or wrap individual tests with try/finally:

it("...", async () => {
  vi.useFakeTimers();
  try {
    // test body
  } finally {
    vi.useRealTimers();
  }
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/unit/adapters/acp-adapter.test.ts` around lines 924 - 1071, Move the
vi.useRealTimers() cleanup call from the end of each test method to a
centralized afterEach hook to ensure fake timers are properly reset even if a
test fails before reaching the cleanup code. The affected tests are the ones
starting with vi.useFakeTimers() such as the tests for timeout behavior, abort
signal handling, and the timeout failure scenarios. This prevents fake timer
state from leaking between tests when exceptions occur during test execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/unit/adapters/acp-adapter.test.ts`:
- Around line 924-1071: Move the vi.useRealTimers() cleanup call from the end of
each test method to a centralized afterEach hook to ensure fake timers are
properly reset even if a test fails before reaching the cleanup code. The
affected tests are the ones starting with vi.useFakeTimers() such as the tests
for timeout behavior, abort signal handling, and the timeout failure scenarios.
This prevents fake timer state from leaking between tests when exceptions occur
during test execution.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f198963-f31c-4100-90ef-d29fad26846e

📥 Commits

Reviewing files that changed from the base of the PR and between 51539fe and 695d46c.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (2)
  • src/adapters/base/acp-adapter.ts
  • test/unit/adapters/acp-adapter.test.ts
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • netlify/blueprints (manual)

seancdavis and others added 5 commits June 23, 2026 13:43
The ACP adapter used only its own 10m DEFAULT_TIMEOUT_MS and ignored
`input.timeoutMs`, so Gemini scenarios with the default 15m limit were
actually killed at 10m. It also ignored `input.signal` entirely, so
per-scenario/overall token limits and Ctrl-C never tore down the child.

Use `input.timeoutMs ?? default` for the timeout, wire up the abort
signal (SIGTERM → SIGKILL) the way the base adapter does, and attribute
any observed token usage on the timeout/abort paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the gemini CLI (or a subprocess it spawned, e.g. `node --test`)
lingered after the agent finished, the run could be killed by the
timeout/abort right after `connection.prompt()` returned a terminal
`prompt_result`/`end_turn` — and the timeout path discarded the result
and token usage, turning a success into a "timeout" failure with 0
tokens and no score.

Track whether the prompt reached a genuine terminal result (any stop
reason other than `cancelled`). If it did, record the run as completed
with its real result and token usage even when the timeout or abort
fired, and normalise the child's kill-induced non-zero exit code. Runs
that never reach a terminal result still fail as timeouts/aborts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ACP agents shell out to subprocesses (e.g. `node --test`) that can
outlive the CLI. Killing only the CLI left those grandchildren running,
leaking resources and — when they held the child's stdio pipes open —
delaying the `close` event that ends the job.

Spawn the CLI detached so its pid doubles as a process-group id, and add
killProcessTree() to signal the whole group (with a direct child.kill
fallback) at every teardown site: timeout, abort, Ctrl-C cleanup, and
the normal post-prompt shutdown.

The acp-adapter unit tests now stub process.kill so the group signal
can't reach the real OS during the run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds regression tests for the false-timeout bug: a run whose terminal
prompt_result lands after the timeout or abort fires is recorded as a
completed success (real result, real token usage, exit 0); a run that
never reaches a terminal result still fails as a timeout; the per-run
input.timeoutMs is honoured over the adapter default; and an abort
before completion tears down the whole process group.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@seancdavis seancdavis force-pushed the fix/gemini-acp-completion-timeout branch from 695d46c to 98ffa1b Compare June 23, 2026 17:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/adapters/base/acp-adapter.ts`:
- Around line 454-462: The `killedAfterCompletion` logic on line 456 does not
account for the SIGTERM sent by the finally block after clean completion,
causing clean completions with no assistant text to be incorrectly marked as
failed. Modify the error gate condition on line 459 (the check that includes
`exitCode !== 0 && !state.lastAssistantMessage`) to also gate on
`!completedCleanly` so that when the process completes cleanly, the non-zero
exit code from the finally block's SIGTERM is not treated as an error. This
ensures clean completions are not fabricating an error message just because of
the normal cleanup SIGTERM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f76c1aad-d89c-4f44-bf46-a98352a7a6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 695d46c and 98ffa1b.

📒 Files selected for processing (2)
  • src/adapters/base/acp-adapter.ts
  • test/unit/adapters/acp-adapter.test.ts
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • netlify/blueprints (manual)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/unit/adapters/acp-adapter.test.ts

Comment thread src/adapters/base/acp-adapter.ts Outdated
seancdavis and others added 2 commits June 23, 2026 14:08
A run that reached a terminal result (end_turn) but produced only tool
calls — no final assistant text — left lastAssistantMessage null. The
cleanup path closes stdin and SIGTERMs a lingering CLI, so the child
exits non-zero, and the exit-code error gate then manufactured "Agent
process exited with non-zero code", marking a successful run as failed.

Gate that error (and the exit-code normalisation) on completedCleanly
rather than only on a timeout/abort kill, so any non-zero/signal exit
after a clean turn is treated as our own cleanup, not a failure.

Addresses CodeRabbit review finding on PR #43.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A scenario time limit, token limit, or Ctrl-C is a hard cap. Previously
the adapter preserved a run that completed in the same instant the
timeout/abort fired as a success — but the runner already re-stamped
aborts as failures, so time-limit and token-limit races resolved
differently (pass vs. fail).

Drop the completion-race preservation from the timeout/abort paths so a
cutoff always fails, regardless of a last-instant terminal result. Token
usage observed before the cut is still attributed. The original
false-timeout bug stays fixed: those runs finish well within the limit
(honouring input.timeoutMs) and are reaped via the process tree, so no
cutoff fires for them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@seancdavis seancdavis merged commit 3eacdba into main Jun 23, 2026
9 checks passed
@seancdavis seancdavis deleted the fix/gemini-acp-completion-timeout branch June 23, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant