Make Auto Review snapshot-aware and preserve superseded results

## Summary

Auto Review often finishes after the work it reviewed has already moved on or merged. The findings can then arrive in the active LLM session as if they apply to the current state, even when they target an older snapshot. We should make Auto Review snapshot-aware, preserve superseded review artifacts, and evaluate model/follow-up settings before changing defaults.

## Current Evidence

Local session catalog sample from `/Users/cbusillo/.code/sessions/index/catalog.jsonl`:

- 221 Auto Review sessions found.
- Median duration: about 5.3 minutes.
- 75th percentile: about 14.2 minutes.
- 90th percentile: about 39.5 minutes.
- Mean duration: about 16.2 minutes.
- 41 sessions exceeded 20 minutes.
- 18 sessions exceeded 60 minutes.
- Longest observed session was about 163.6 minutes.

Developer-message history also showed 280 background Auto Review completion notifications spread across 9 main sessions, with one session receiving 83 completion messages. This matches the observed behavior where findings can arrive repeatedly after the main work has already continued.

Current local config appears to use:

```toml
auto_review_model = "gpt-5.4-mini"
auto_review_model_reasoning_effort = "low"
auto_review_resolve_model = "gpt-5.4"
auto_review_resolve_model_reasoning_effort = "high"
auto_review_use_chat_model = false
auto_review_followup_attempts = 10
review_auto_resolve = true
```

This suggests initial review may be relatively cheap, while auto-resolve follow-up loops can become long, expensive sessions.

## Problem

Auto Review currently behaves like an asynchronous reviewer but reports into the active session as if its result is current. That creates several issues:

- Late findings can target a stale commit or branch state.
- The active LLM may not know whether Auto Review is still running.
- Superseded review work can still be useful, but it should not be treated as fresh instructions.
- Up to 10 auto-resolve follow-ups can turn a background review into a long-running workstream.
- Changing model/follow-up defaults without evidence could make review faster but worse.

## Desired Behavior

- Track Auto Review state by `(repo, branch/worktree, target snapshot, diff base)`.
- Expose active Auto Review state to the main agent/session.
- On completion, compare the review target snapshot to current HEAD or the current active work snapshot.
- If the result is current, report it normally.
- If the result is superseded, preserve the worktree/artifact and report it as stale/superseded instead of injecting it as active instructions.
- Keep superseded findings discoverable for manual adoption.
- Prevent duplicate reviews for the same snapshot from producing repeated active interruptions.

## Proposed Approach

1. Add snapshot metadata to Auto Review lifecycle records and completion messages.
2. Add a completion gate that classifies results as current or superseded before surfacing them to the main session.
3. Preserve superseded artifacts and worktrees, but route them to a review inbox/log instead of the active instruction stream.
4. Add status visible to the active agent: idle/running/current findings/superseded/cancelled, target snapshot, age, worktree path, and auto-resolve phase.
5. Build a replay/evaluation harness from past Auto Review sessions before changing model defaults or follow-up limits.

## Model And Follow-Up Study

Questions to answer before tuning defaults:

- Does `gpt-5.4-mini` at low reasoning produce noisy findings that trigger unnecessary auto-resolve loops?
- Would a stronger reviewer with fewer follow-ups reduce total wall time and interruptions while preserving quality?
- Is `gpt-5.4` high reasoning necessary for every resolve pass?
- What is the quality/speed tradeoff for one or two follow-ups versus the current limit of 10?

Suggested variants to compare on replayed historical cases:

- Current config.
- Same reviewer, one follow-up.
- Stronger reviewer, one follow-up.
- Stronger reviewer, no auto-resolve.
- Current reviewer, resolve at medium reasoning.

Score for finding correctness, duplicate/noisy findings, stale result handling, wall time, number of turns, and whether the final result materially improved after follow-ups.

## Acceptance Criteria

- [ ] Auto Review completion messages include target snapshot metadata and the current branch/head they were compared against.
- [ ] A review result is treated as stale/superseded when the active workspace branch/head or review epoch has advanced past the snapshot the review targeted before results are injected.
- [ ] Superseded Auto Review results are not injected as active instructions; stale findings are archived with snapshot metadata and remain discoverable as historical review output.
- [ ] If branch/head advances while Auto Review is in flight, completion marks the result superseded and archives it instead of injecting follow-up instructions.
- [ ] Active sessions can see whether Auto Review is running, what snapshot it targets, and whether newer local/remote changes supersede it.
- [ ] Duplicate/completed reviews for the same snapshot do not repeatedly interrupt the user or agent.
- [ ] Auto Review has bounded fix-train behavior: it stops or asks for help after a concrete retry/review depth limit instead of chaining indefinitely.
- [ ] Wrong-session delivery is prevented or detected: Auto Review results must identify the originating session/conversation and must not become active instructions in a different live session.
- [ ] The review model gets a defined minimum context package: reviewed diff, branch/head state, relevant current repo status, prior review/fix attempts for the same snapshot or fix train, and target session metadata.
- [ ] Auto Review latency is measured per invocation before model/follow-up defaults or async routing behavior are changed.
- [ ] Rollout files, session logs, or debug logs can be scanned to observe/prove stale review, wrong-session delivery, duplicate review, and fix-train cases before and after changes.
- [ ] Stale-result routing, context packaging, wrong-session routing, and fix-train limits are covered by targeted regression tests, exec-harness tests, VT100/UI tests, or replay fixtures as appropriate.
- [ ] Exec harness coverage is added where possible so Auto Review behavior can be tested outside an interactive TUI session.
- [ ] Local LLM / LM Studio experiments may be used for large rollout/log scanning, but findings that drive code changes must be backed by deterministic scripts, sampled artifacts, or reproducible queries.

## Current Status

State: Active, inventory started; do not execute broad fixes yet.

Immediate slice: prove and prevent stale Auto Review results from becoming active instructions. Ship snapshot identity, stale/superseded detection, wrong-session guards where needed, and result routing before broad visibility, latency, or model-default changes.

Inventory findings so far:

- TUI Auto Review launches from `ChatWidget::maybe_trigger_auto_review` / `launch_background_review`, then `run_background_review` captures a ghost snapshot and creates an AutoReview agent.
- `run_background_review` already passes useful metadata into the agent manager: owner session id, branch/batch id, snapshot id as `worktree_base`, and `AgentSourceKind::AutoReview`.
- `AgentInfo` status payloads currently drop some of that metadata before the TUI and exec tracker see completions: owner session id and snapshot/worktree base are not exposed in `AgentInfo`.
- The active-instruction injection point is concrete: TUI completion handling builds a developer note in `on_background_review_finished`, then `record_background_review_note` can send `Op::AddPostTurnDeveloperInput` when findings exist.
- Exec Auto Review has a separate `AutoReviewTracker` that emits `[auto-review] ... Merge <worktree>` summaries, but it tracks branch/path/summary/error rather than snapshot/session metadata.
- Existing TUI tests cover background review launch, duplicate suppression, completion notices, failure classification, pending ranges, and stale timeout reclamation, but not old/wrong-session terminal completion arriving after a newer review starts.
- Rollout files already contain evidence candidates. Recent `~/.code/sessions/2026/06/01` and `2026/05/31` rollouts include `Background auto-review`, `Snapshot:`, and `Merge /Users/cbusillo/.code/working/...` entries. Today’s `rollout-2026-06-01T10-33-35-...jsonl` includes multiple background auto-review entries.
- Always-on artifacts include `~/.code/state/review/repo-*/review.lock`, `snapshot.epoch`, and `~/.code/debug_logs/critical.log.*`; opt-in TUI session logs live under `~/.code/debug_logs/session-*.jsonl` only when `CODEX_TUI_RECORD_SESSION=1` is enabled.

Recommended first PR shape:

1. Extend Auto Review status metadata so completion handling can compare agent id, owner session id, and snapshot/worktree base against the live `BackgroundReviewState`.
2. Add a pure/session-safe stale-or-mismatched completion predicate and unit tests.
3. Gate `on_background_review_finished` / `record_background_review_note` so stale or wrong-session completions are archived/visible but never enqueue `AddPostTurnDeveloperInput` or update reviewed markers.
4. Add an exec-harness or exec-unit test for AutoReviewTracker metadata once the protocol exposes snapshot/session fields; otherwise add a focused TUI harness test first and document the exec gap.
5. Add a deterministic rollout/log scanner prototype or fixture that detects stale-review/fix-train signatures without relying on an LLM.

Evidence strategy: Before changing behavior, inspect recent rollout files, session logs, and debug logs for concrete stale-review, wrong-session, duplicate-review, and fix-train examples. Build deterministic scanners or replay fixtures for any case we intend to fix. Use local LLM / LM Studio experiments for broad artifact triage where useful, but require deterministic evidence before accepting conclusions.

Testing strategy: Prefer exec-harness coverage for lifecycle/result-routing behavior that does not require a live terminal. Use VT100/UI tests for user-visible review status, stale-result presentation, or interruption behavior. Add focused unit tests for pure snapshot/session matching and fix-train depth logic.

Suggested sequence:
1. Lifecycle and artifact inventory with agents. (started 2026-06-01)
2. Rollout/log scanner prototype for stale, duplicate, wrong-session, and fix-train signatures.
3. Snapshot identity and stale-result routing.
4. Wrong-session delivery guard.
5. Running/completed/superseded review visibility for active sessions.
6. Duplicate review suppression for the same snapshot.
7. Bounded fix-train depth and help/stop behavior.
8. Reviewer context completeness.
9. Latency measurement and async routing decisions.
10. Replay-backed model/follow-up default tuning.

Blocked by: None. Upstream Codex CLI alignment is intentionally parked until Auto Review can reliably avoid stale active instructions, wrong-session delivery, and unbounded fix trains.

Sub-issue decision: Wait until the lifecycle inventory maps real code boundaries. If a child issue is created before then, make it only the first slice: "Prevent stale Auto Review results from becoming active instructions."

Last verified: 2026-06-01 from live session feedback, recent stale Auto Review interruptions, agent lifecycle/test/artifact inventories, and bounded rollout/log scans.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Auto Review snapshot-aware and preserve superseded results #76

Summary

Current Evidence

Problem

Desired Behavior

Proposed Approach

Model And Follow-Up Study

Acceptance Criteria

Current Status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make Auto Review snapshot-aware and preserve superseded results #76

Description

Summary

Current Evidence

Problem

Desired Behavior

Proposed Approach

Model And Follow-Up Study

Acceptance Criteria

Current Status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions