cortexkit · Zireael · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/.alfonso/plans/codegraph-benchmark-replication.md b/.alfonso/plans/codegraph-benchmark-replication.md
@@ -0,0 +1,87 @@
+# CodeGraph benchmark replication plan for AFT
+
+## 1. Metrics to replicate
+
+Replicate the deterministic, no-LLM retrieval-quality eval from `codegraph/__tests__/evaluation/`:
+
+- **Recall** over expected symbols, using the same pass rule as CodeGraph (`recall >= 0.5`).
+- **MRR** from the first ranked result that matches an expected symbol or expected file.
+- **Precision@k** for k = 1, 5, 10. CodeGraph's current scorer does not expose P@k, but the user asked for it and it is compatible with the same ranked result list.
+- **Found/missed symbols** per case.
+- **Real wall-clock latency** around the actual tool dispatch. The report will include per-case latency samples plus median and p95 latency at driver summary level. With `--runs > 1`, each query gets per-query median/p95; with the default single run those values equal the single dispatch time.
+
+Keep CodeGraph's `nodeCount`, `edgeCount`, and `edgeDensity` fields optional. AFT's retrieval tools do not expose graph edge counts for `aft_search`, `grep`, or ripgrep, so those fields will remain absent instead of fabricated.
+
+## 2. Corpus choice
+
+Use three corpus sources:
+
+1. **`codegraph` (default for apples-to-apples AFT runs):** an AFT-side translation of CodeGraph's 12 test-case shapes. It preserves CodeGraph's split between exact symbol lookup (`searchNodes`) and broader context exploration (`findRelevantContext`), but rewrites Elasticsearch-specific symbols (`TransportService`, `RestController`, etc.) to equivalent symbols in this repository (`BinaryBridge`, `BridgeOptions`, `handle_semantic_search`, etc.). Each rewritten case records its `sourceCaseId` and a note explaining the substitution.
+2. **`codegraph-original`:** a JSON copy of the exact CodeGraph structured corpus. This is useful when someone points the harness at Elasticsearch or another checkout containing those symbols. It is expected to fail or be skipped on `opencode-aft`, so it is not the default run for this repo.
+3. **`aft`:** small AFT-native supplemental cases for tool-surface coverage that CodeGraph does not have one-to-one (outline/zoom/navigate-oriented cases). Custom corpus files can also be loaded by path with the same schema.
+
+This keeps the publishable comparison honest: `codegraph-original` is the literal upstream corpus; `codegraph` is the translated corpus used to run the same methodology against AFT itself.
+
+## 3. Tool mapping
+
+| CodeGraph eval API/tool | AFT equivalent in this harness | Notes |
+| --- | --- | --- |
+| `searchNodes(query, { limit, kinds })` | `aft_search` (`semantic_search` bridge command with `top_k`) | Use symbol/file/kind metadata from AFT hybrid results. `kinds` is retained as corpus metadata and reported, but AFT does not currently filter semantic search by kind. |
+| `findRelevantContext(query, { searchLimit, traversalDepth, maxNodes })` | `aft_search` by default; optional corpus cases may request `aft_outline`, `aft_zoom`, or `aft_navigate` | AFT has separate focused tools instead of one subgraph-returning context API. For apples-to-apples scoring, the ranked retrieval result is still normalized into the same item list. |
+| CodeGraph `node`/source inspection | `aft_zoom` | Only for cases with explicit `file` + `symbol`; not used for broad search scoring by default. |
+| CodeGraph `context`/file overview | `aft_outline` | Useful for AFT-specific supplemental cases. Outline text is normalized into file/symbol-ish result items when possible. |
+| CodeGraph `trace`/call graph | `aft_navigate` commands (`callers`, `call_tree`, `trace_to_symbol`, etc.) | Only measured for explicit navigate cases; graph edge density is not scored. |
+| Plain lexical baseline | AFT bridge `grep` and external `rg -F` | Both use real wall-clock dispatch and fixed-string lexical matching. |
+| Sanity baseline | List files only | Ranks file paths without looking at query text; proves the scorer is not trivially passing. |
+
+## 4. What will not be replicated
+
+- **Agent A/B matrix** (`scripts/agent-eval/`, tmux/Claude runs, token/cost/tool-call behavior): explicitly out of scope for this task and depends on harness machinery AFT does not have here.
+- **Graph edge metrics** (`edgeCount`, `edgeDensity`) for non-graph AFT drivers: AFT does not expose a CodeGraph-style returned subgraph for `aft_search`, AFT grep, ripgrep, or list-files. Reporting zero would be misleading, so those fields stay omitted.
+- **Kind-filtered semantic retrieval:** CodeGraph can pass `kinds` into `searchNodes`; AFT's semantic search does not accept a kind filter today. Kinds are used only for metadata/diagnostics.
+- **AFT `aft_search` vs CodeGraph on Elasticsearch in this commit:** the harness supports `codegraph-original`, but the verification run for this task is against `opencode-aft` because that is the indexed local target.
+
+## 5. Output format
+
+Emit JSON close to CodeGraph's `EvalReport`:
+
+```ts
+{
+  timestamp: string,
+  codebasePath: string,
+  codegraphSha: string,
+  aftSha?: string,
+  benchmark: "codegraph-replication",
+  corpus: string,
+  driver: string,
+  summary: {
+    total: number,
+    passed: number,
+    failed: number,
+    skipped: number,
+    meanRecall: number,
+    meanMRR: number,
+    meanPrecisionAt1: number,
+    meanPrecisionAt5: number,
+    meanPrecisionAt10: number,
+    latencyMsMedian: number,
+    latencyMsP95: number
+  },
+  results: EvalResult[]
+}
+```
+
+`EvalResult` keeps CodeGraph-compatible fields (`caseId`, `pass`, `recall`, `mrr`, `foundSymbols`, `missedSymbols`, `latencyMs`) and adds ranked `results`, `precisionAtK`, `driver`, `api`, and optional `skipReason`. A markdown summary with the same aggregate table and per-case rows will be written beside the JSON so results can be pasted into docs/README.
+
+## 6. code-review-graph patterns borrowed
+
+I also read `/Users/ufukaltinok/Work/OSS/code-review-graph/code_review_graph/eval/` for methodology inspiration. This benchmark will still replicate CodeGraph first, but borrows these low-cost patterns where they improve reproducibility without adding dependencies on that project:
+
+- **Pinned repo metadata shape:** corpus entries can carry repo name, URL, language, size category, and pinned commit fields, matching code-review-graph's `configs/*.yaml` discipline. v1 runs against `opencode-aft`, but this schema lets us add the reusable `fastapi`, `flask`, `gin`, `express`, `httpx`, and `code-review-graph` repos later without redesign.
+- **Separated task axes:** keep CodeGraph's `searchNodes` vs `findRelevantContext` API labels, but also tag cases with categories analogous to code-review-graph's `search_queries` and `multi_hop_tasks` so later reports can split symbol lookup, context exploration, and navigation/multi-hop retrieval.
+- **Deterministic reporting:** include corpus path, codebase SHA, AFT binary path, driver, top-k, and runs in every report. This mirrors code-review-graph's pinned-SHA/config-driven reproducibility while keeping the AFT harness simple.
+- **Real wall-clock timing per dispatch:** code-review-graph times build/search stages directly; AFT will time the actual bridge or process dispatch around each query and aggregate median/p95.
+- **Token accounting is deferred:** code-review-graph's tiktoken-calibrated token-efficiency axis is useful, but it belongs to a broader agent/context benchmark, not this no-LLM CodeGraph retrieval replication. v1 may record result payload sizes later, but will not mix token-efficiency scores into retrieval quality.
+
+Patterns intentionally not borrowed for v1: the six-axis suite (`impact_accuracy`, `multi_hop_retrieval`, `search_quality`, `token_efficiency`, `flow_completeness`, `build_performance`) and repository cloning/build orchestration. Those are valuable follow-on axes, but this deliverable stays focused on deterministic retrieval scoring against AFT's actual tool surface.
+
diff --git a/.alfonso/release-notes/v0.30.0.md b/.alfonso/release-notes/v0.30.0.md
@@ -0,0 +1,52 @@
+# PTY support — agents can now drive real terminals
+
+The headline of this release. `bash` now accepts `pty: true` (with `background: true`) to spawn commands inside a real PTY — every interactive program that needed a terminal is now reachable from an agent loop. Python and Node REPLs, `vim`, `htop`, `top`, `less`, `fzf`, build TUIs, even a nested `opencode` session — all work end-to-end.
+
+![Yo dawg, I heard you like OpenCode so I put an OpenCode inside your OpenCode](assets/ocinoc.png)
+
+Yes, really — `opencode` inside `opencode` works. PTY support means the agent can drive any TUI, including a full nested AFT-equipped OpenCode session, complete with sidebar, MCP servers, LSP status, and another agent answering prompts. Recursion all the way down.
+
+### How it works
+
+- **`bash({pty: true, background: true, ptyRows?, ptyCols?})`** — spawn a PTY-backed task. Defaults are 24×80; caps are 60×140 to keep `bash_status` snapshots bounded.
+- **`bash_status({taskId, outputMode})`** — read the terminal state.
+  - `"screen"` — vt100-rendered visible terminal (rows × cols characters)
+  - `"raw"` — uncompressed bytes including ANSI escape sequences
+  - `"both"` — separate fields for each
+- **`bash_write({taskId, input})`** — send keystrokes. Input is either a verbatim string or an array mixing strings and `{key: "..."}` objects for atomic text + control key sequences:
+
+  ```
+  bash_write({taskId, input: [
+    "iHello",
+    {key: "esc"},
+    ":wq",
+    {key: "enter"},
+  ]})
+  ```
+
+  Named keys cover `enter`/`return` (CR), `tab`, `space`, `backspace`, `esc`/`escape`, arrow keys, navigation keys, `delete`, `insert`, `f1`–`f12`, and `ctrl-a` through `ctrl-z`.
+
+PTY tasks run on Unix via `portable-pty` and on Windows via ConPTY.
+
+## bash_watch unifies pattern notifications and sync waits
+
+New `bash_watch` tool replaces ad-hoc wait flags on `bash_status`. Two modes:
+
+**Sync** — `bash_watch({taskId, pattern?, timeoutMs?})` blocks until the pattern matches, the task exits, or timeout. Without a pattern it waits for task exit. Returns the snapshot inline so the agent gets the result without a separate completion reminder.
+
+**Async** — `bash_watch({taskId, pattern, background: true})` registers a pattern watcher and returns immediately. When the pattern matches mid-stream or the task exits, a single `[BG BASH NOTIFY]` reminder fires with the matched line. The default `[BACKGROUND BASH COMPLETED]` reminder is suppressed for that task.
+
+`bash_status` is now a pure snapshot tool — wait/watch semantics live in `bash_watch`.
+
+## URL fetches no longer hang on slow servers
+
+`aft_outline` and `aft_zoom` URL targets now abort with a clear stall error after 15 seconds without a chunk. Previously a slow or stalled server could hang the bridge indefinitely while waiting on `reader.read()`.
+
+## Other
+
+- `bash` schema rejects `pty: true` without `background: true` and `ptyRows`/`ptyCols` without `pty: true`.
+- OpenCode subagent sessions silently convert `background: true` to foreground bash unless `bash.subagent_background = true` in config.
+- `bash_status` and `bash_kill` are always registered when `bash` is registered (no longer gated on `experimental.bash.background`).
+- Background bash completion delivery now persists `completion_delivered` across plugin restarts, so previously-delivered tasks no longer replay as fresh reminders after restart.
+- Async `bash_watch` exit notifications render as `task X exited` instead of the prior `matched "exited (exit 0)"` framing.
+- The release script blocks minor-version releases when the in-plugin `ANNOUNCEMENT_VERSION` is stale relative to the release tag.
diff --git a/.alfonso/release-notes/v0.30.1.md b/.alfonso/release-notes/v0.30.1.md
@@ -0,0 +1,38 @@
+# v0.30.1
+
+Patch release. Three classes of user-facing fixes: bash PTY parameter handling, LSP failure diagnostics, and Windows plugin auto-update.
+
+## Bash — PTY parameter handling
+
+Agents that defensively included `ptyRows` or `ptyCols` on regular (non-PTY) bash calls were hitting a strict validation error. Some models tried to "fix" it by adding `pty: true` to non-interactive commands, which auto-promoted them to background and broke inline output.
+
+- `ptyRows` and `ptyCols` are now soft-ignored when `pty` is unset or false. The dimensions are only applied when a PTY is actually requested.
+- `pty: true` now implies `background: true`. The two flags no longer have to be set together.
+- Out-of-range or non-integer values return a clean error naming the allowed bounds (e.g. `ptyRows must be an integer between 1 and 60`).
+- Tool descriptions for `ptyRows`/`ptyCols` clarify they apply only when `pty: true`.
+
+## Plugin tool schemas
+
+All optional numeric parameters across the OpenCode plugin (bash, read, aft_search, aft_navigate, aft_zoom, aft_outline, refactor, lsp_diagnostics) now use a JSON-Schema-representable bounded integer schema. Empty sentinels (null, empty string, zero) are rejected at validation with a clear message instead of silently being coerced or — as in an earlier internal build — causing the plugin to fail to load.
+
+A schema-conversion regression test now covers every registered tool, so any future change that introduces an unrepresentable shape will fail before release.
+
+## LSP — failure visibility
+
+When an LSP server fails to start, AFT's response now surfaces stderr output captured from the child process. Previously, broken language-server shims (such as a `typescript-language-server` whose `cli.mjs` was missing) returned opaque `spawn_failed` errors without context.
+
+- Stderr from LSP children is captured in a bounded ring buffer and included in failure responses.
+- When stderr contains `MODULE_NOT_FOUND`, the response adds a hint pointing at the likely fix (reinstall the package-manager binary, or check the `lsp.servers.<name>.binary` path).
+- Clients that crash after a successful initialize are now marked as failed so subsequent file requests stop re-issuing pulls against the dead pipe.
+
+## Auto-update on Windows
+
+Plugin self-update used `spawn("npm")` directly, which fails on Windows because the binary is `npm.cmd`. The auto-update path now resolves `npm.cmd` on Windows (same fix shape as the v0.28.2 LSP install correction).
+
+- `npm install` stderr is captured on failure for diagnostic visibility.
+- `--ignore-scripts` is now passed to the install (matches the LSP install hardening).
+
+## Other
+
+- `aft_outline`/`aft_zoom` URL fetch keeps the 15-second body-stall safety net that landed in v0.30.0.
+- Subagent sessions continue to silently convert `background: true` bash to foreground (introduced in v0.30.0), because subagents have no completion-reminder mechanism.