Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
66669a6
chore: add docker rust validation workflow
Zireael May 24, 2026
50a7e65
aft-t6p.7: provider capabilities — config profiles, dimension pass-th…
Zireael May 24, 2026
34073be
aft-t6p.1: embedding query/document prompt-template support
Zireael May 24, 2026
f60a2a9
aft-t6p.15: semantic config trust boundary — TypeScript schema, warni…
Zireael May 24, 2026
0f640ca
aft-t6p.8: semantic index lifecycle — immutable snapshots, stale-vect…
Zireael May 25, 2026
54377d9
chore: add testuser non-root runner to docker-rust.ps1, update benchm…
Zireael May 25, 2026
0c60fcc
aft-t6p.9: semantic fingerprint — config matrix, diff engine, V6→V7 u…
Zireael May 25, 2026
63c8319
aft-t6p.10: file policy, docs chunker, fingerprint matrix
Zireael May 25, 2026
a6fb00c
feat(aft-t6p.11): non-blocking cold start index with cancellation, pr…
Zireael May 25, 2026
fa95b5e
feat(semantic): contextualized document-chunk embedding (aft-t6p.23)
Zireael May 27, 2026
0a683f1
feat(semantic-index): VectorStore abstraction — extract flat store in…
Zireael May 28, 2026
6138adb
fix(downloader): separate binary replacement from temp cleanup
Zireael May 28, 2026
02973c4
feat(semantic): V8 serialization with file manifest and chunk_hash
Zireael May 28, 2026
2e4ccb9
feat(semantic): typed vector representation with storage strategy, no…
Zireael May 29, 2026
134aa04
fix(semantic): case-insensitive Content-Length parsing in mock server…
Zireael May 30, 2026
8d0a976
aft-t6p.22: native binary packed-vector storage and Hamming search
Zireael May 30, 2026
945cef2
chore: bead tracking, architecture docs, and biome config
Zireael May 30, 2026
3810ac3
aft-t6p.3: search pipeline metrics and diagnostics core
Zireael May 30, 2026
656df81
aft-t6p.13: JSONL semantic diagnostics logging
Zireael May 30, 2026
f0bf72d
aft-t6p.16: DiagnosticsOutputMode — configurable verbosity in aft_sea…
Zireael May 30, 2026
0195bd2
feat(aft-t6p.15): add reranking pipeline for semantic search
Zireael May 31, 2026
6e4c862
test(aft-t6p.6.1): add config, profile, and typed-vector tests
Zireael May 31, 2026
a7ff8e4
test(aft-t6p.6.2): add fingerprint diff matrix tests
Zireael May 31, 2026
e5d427b
test(aft-t6p.6.3): add file policy, docs chunking, and manifest tests
Zireael May 31, 2026
7eade04
test(aft-t6p.6.4): add VectorStore, binary packed-vector, and Hamming…
Zireael May 31, 2026
0bd2b64
test(aft-t6p.6.5): add lifecycle, snapshot, and pruning tests
Zireael May 31, 2026
91e31e1
test(aft-t6p.6.6): add search pipeline, metrics, and diagnostics tests
Zireael May 31, 2026
779770f
test(aft-t6p.6.7): add concurrency and race condition tests
Zireael May 31, 2026
51f8a4d
test(aft-t6p.6.8): add security trust boundary tests
Zireael May 31, 2026
37a980a
fix: add missing source_vector_kind to validate_compatible test
Zireael May 31, 2026
a5473f6
feat(aft-t6p.14): add semantic eval harness
Zireael Jun 1, 2026
09690ff
feat(aft-t6p.17): add semantic doctor health-check command
Zireael Jun 1, 2026
8f5cf53
feat(aft-t6p.4): extend status with semantic health metrics
Zireael Jun 1, 2026
b008fae
test(aft-t6p.2.1): add reranking tests and behavior fixes
Zireael Jun 2, 2026
45f4ed0
chore: remove local agent tooling dirs from PR and gitignore
Zireael Jun 2, 2026
349332b
chore: remove remaining non-source files from PR
Zireael Jun 2, 2026
603115c
chore: restore upstream .alfonso, keep other junk removed
Zireael Jun 2, 2026
95ea25c
fix: address greptile and qubic review comments on PR #87
Zireael Jun 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .alfonso/plans/codegraph-benchmark-replication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# CodeGraph benchmark replication plan for AFT

## 1. Metrics to replicate

Replicate the deterministic, no-LLM retrieval-quality eval from `codegraph/__tests__/evaluation/`:

- **Recall** over expected symbols, using the same pass rule as CodeGraph (`recall >= 0.5`).
- **MRR** from the first ranked result that matches an expected symbol or expected file.
- **Precision@k** for k = 1, 5, 10. CodeGraph's current scorer does not expose P@k, but the user asked for it and it is compatible with the same ranked result list.
- **Found/missed symbols** per case.
- **Real wall-clock latency** around the actual tool dispatch. The report will include per-case latency samples plus median and p95 latency at driver summary level. With `--runs > 1`, each query gets per-query median/p95; with the default single run those values equal the single dispatch time.

Keep CodeGraph's `nodeCount`, `edgeCount`, and `edgeDensity` fields optional. AFT's retrieval tools do not expose graph edge counts for `aft_search`, `grep`, or ripgrep, so those fields will remain absent instead of fabricated.

## 2. Corpus choice

Use three corpus sources:

1. **`codegraph` (default for apples-to-apples AFT runs):** an AFT-side translation of CodeGraph's 12 test-case shapes. It preserves CodeGraph's split between exact symbol lookup (`searchNodes`) and broader context exploration (`findRelevantContext`), but rewrites Elasticsearch-specific symbols (`TransportService`, `RestController`, etc.) to equivalent symbols in this repository (`BinaryBridge`, `BridgeOptions`, `handle_semantic_search`, etc.). Each rewritten case records its `sourceCaseId` and a note explaining the substitution.
2. **`codegraph-original`:** a JSON copy of the exact CodeGraph structured corpus. This is useful when someone points the harness at Elasticsearch or another checkout containing those symbols. It is expected to fail or be skipped on `opencode-aft`, so it is not the default run for this repo.
3. **`aft`:** small AFT-native supplemental cases for tool-surface coverage that CodeGraph does not have one-to-one (outline/zoom/navigate-oriented cases). Custom corpus files can also be loaded by path with the same schema.

This keeps the publishable comparison honest: `codegraph-original` is the literal upstream corpus; `codegraph` is the translated corpus used to run the same methodology against AFT itself.

## 3. Tool mapping

| CodeGraph eval API/tool | AFT equivalent in this harness | Notes |
| --- | --- | --- |
| `searchNodes(query, { limit, kinds })` | `aft_search` (`semantic_search` bridge command with `top_k`) | Use symbol/file/kind metadata from AFT hybrid results. `kinds` is retained as corpus metadata and reported, but AFT does not currently filter semantic search by kind. |
| `findRelevantContext(query, { searchLimit, traversalDepth, maxNodes })` | `aft_search` by default; optional corpus cases may request `aft_outline`, `aft_zoom`, or `aft_navigate` | AFT has separate focused tools instead of one subgraph-returning context API. For apples-to-apples scoring, the ranked retrieval result is still normalized into the same item list. |
| CodeGraph `node`/source inspection | `aft_zoom` | Only for cases with explicit `file` + `symbol`; not used for broad search scoring by default. |
| CodeGraph `context`/file overview | `aft_outline` | Useful for AFT-specific supplemental cases. Outline text is normalized into file/symbol-ish result items when possible. |
| CodeGraph `trace`/call graph | `aft_navigate` commands (`callers`, `call_tree`, `trace_to_symbol`, etc.) | Only measured for explicit navigate cases; graph edge density is not scored. |
| Plain lexical baseline | AFT bridge `grep` and external `rg -F` | Both use real wall-clock dispatch and fixed-string lexical matching. |
| Sanity baseline | List files only | Ranks file paths without looking at query text; proves the scorer is not trivially passing. |

## 4. What will not be replicated

- **Agent A/B matrix** (`scripts/agent-eval/`, tmux/Claude runs, token/cost/tool-call behavior): explicitly out of scope for this task and depends on harness machinery AFT does not have here.
- **Graph edge metrics** (`edgeCount`, `edgeDensity`) for non-graph AFT drivers: AFT does not expose a CodeGraph-style returned subgraph for `aft_search`, AFT grep, ripgrep, or list-files. Reporting zero would be misleading, so those fields stay omitted.
- **Kind-filtered semantic retrieval:** CodeGraph can pass `kinds` into `searchNodes`; AFT's semantic search does not accept a kind filter today. Kinds are used only for metadata/diagnostics.
- **AFT `aft_search` vs CodeGraph on Elasticsearch in this commit:** the harness supports `codegraph-original`, but the verification run for this task is against `opencode-aft` because that is the indexed local target.

## 5. Output format

Emit JSON close to CodeGraph's `EvalReport`:

```ts
{
timestamp: string,
codebasePath: string,
codegraphSha: string,
aftSha?: string,
benchmark: "codegraph-replication",
corpus: string,
driver: string,
summary: {
total: number,
passed: number,
failed: number,
skipped: number,
meanRecall: number,
meanMRR: number,
meanPrecisionAt1: number,
meanPrecisionAt5: number,
meanPrecisionAt10: number,
latencyMsMedian: number,
latencyMsP95: number
},
results: EvalResult[]
}
```

`EvalResult` keeps CodeGraph-compatible fields (`caseId`, `pass`, `recall`, `mrr`, `foundSymbols`, `missedSymbols`, `latencyMs`) and adds ranked `results`, `precisionAtK`, `driver`, `api`, and optional `skipReason`. A markdown summary with the same aggregate table and per-case rows will be written beside the JSON so results can be pasted into docs/README.

## 6. code-review-graph patterns borrowed

I also read `/Users/ufukaltinok/Work/OSS/code-review-graph/code_review_graph/eval/` for methodology inspiration. This benchmark will still replicate CodeGraph first, but borrows these low-cost patterns where they improve reproducibility without adding dependencies on that project:

- **Pinned repo metadata shape:** corpus entries can carry repo name, URL, language, size category, and pinned commit fields, matching code-review-graph's `configs/*.yaml` discipline. v1 runs against `opencode-aft`, but this schema lets us add the reusable `fastapi`, `flask`, `gin`, `express`, `httpx`, and `code-review-graph` repos later without redesign.
- **Separated task axes:** keep CodeGraph's `searchNodes` vs `findRelevantContext` API labels, but also tag cases with categories analogous to code-review-graph's `search_queries` and `multi_hop_tasks` so later reports can split symbol lookup, context exploration, and navigation/multi-hop retrieval.
- **Deterministic reporting:** include corpus path, codebase SHA, AFT binary path, driver, top-k, and runs in every report. This mirrors code-review-graph's pinned-SHA/config-driven reproducibility while keeping the AFT harness simple.
- **Real wall-clock timing per dispatch:** code-review-graph times build/search stages directly; AFT will time the actual bridge or process dispatch around each query and aggregate median/p95.
- **Token accounting is deferred:** code-review-graph's tiktoken-calibrated token-efficiency axis is useful, but it belongs to a broader agent/context benchmark, not this no-LLM CodeGraph retrieval replication. v1 may record result payload sizes later, but will not mix token-efficiency scores into retrieval quality.

Patterns intentionally not borrowed for v1: the six-axis suite (`impact_accuracy`, `multi_hop_retrieval`, `search_quality`, `token_efficiency`, `flow_completeness`, `build_performance`) and repository cloning/build orchestration. Those are valuable follow-on axes, but this deliverable stays focused on deterministic retrieval scoring against AFT's actual tool surface.

52 changes: 52 additions & 0 deletions .alfonso/release-notes/v0.30.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# PTY support — agents can now drive real terminals

The headline of this release. `bash` now accepts `pty: true` (with `background: true`) to spawn commands inside a real PTY — every interactive program that needed a terminal is now reachable from an agent loop. Python and Node REPLs, `vim`, `htop`, `top`, `less`, `fzf`, build TUIs, even a nested `opencode` session — all work end-to-end.

![Yo dawg, I heard you like OpenCode so I put an OpenCode inside your OpenCode](assets/ocinoc.png)

Yes, really — `opencode` inside `opencode` works. PTY support means the agent can drive any TUI, including a full nested AFT-equipped OpenCode session, complete with sidebar, MCP servers, LSP status, and another agent answering prompts. Recursion all the way down.

### How it works

- **`bash({pty: true, background: true, ptyRows?, ptyCols?})`** — spawn a PTY-backed task. Defaults are 24×80; caps are 60×140 to keep `bash_status` snapshots bounded.
- **`bash_status({taskId, outputMode})`** — read the terminal state.
- `"screen"` — vt100-rendered visible terminal (rows × cols characters)
- `"raw"` — uncompressed bytes including ANSI escape sequences
- `"both"` — separate fields for each
- **`bash_write({taskId, input})`** — send keystrokes. Input is either a verbatim string or an array mixing strings and `{key: "..."}` objects for atomic text + control key sequences:

```
bash_write({taskId, input: [
"iHello",
{key: "esc"},
":wq",
{key: "enter"},
]})
```

Named keys cover `enter`/`return` (CR), `tab`, `space`, `backspace`, `esc`/`escape`, arrow keys, navigation keys, `delete`, `insert`, `f1`–`f12`, and `ctrl-a` through `ctrl-z`.

PTY tasks run on Unix via `portable-pty` and on Windows via ConPTY.

## bash_watch unifies pattern notifications and sync waits

New `bash_watch` tool replaces ad-hoc wait flags on `bash_status`. Two modes:

**Sync** — `bash_watch({taskId, pattern?, timeoutMs?})` blocks until the pattern matches, the task exits, or timeout. Without a pattern it waits for task exit. Returns the snapshot inline so the agent gets the result without a separate completion reminder.

**Async** — `bash_watch({taskId, pattern, background: true})` registers a pattern watcher and returns immediately. When the pattern matches mid-stream or the task exits, a single `[BG BASH NOTIFY]` reminder fires with the matched line. The default `[BACKGROUND BASH COMPLETED]` reminder is suppressed for that task.

`bash_status` is now a pure snapshot tool — wait/watch semantics live in `bash_watch`.

## URL fetches no longer hang on slow servers

`aft_outline` and `aft_zoom` URL targets now abort with a clear stall error after 15 seconds without a chunk. Previously a slow or stalled server could hang the bridge indefinitely while waiting on `reader.read()`.

## Other

- `bash` schema rejects `pty: true` without `background: true` and `ptyRows`/`ptyCols` without `pty: true`.
- OpenCode subagent sessions silently convert `background: true` to foreground bash unless `bash.subagent_background = true` in config.
- `bash_status` and `bash_kill` are always registered when `bash` is registered (no longer gated on `experimental.bash.background`).
- Background bash completion delivery now persists `completion_delivered` across plugin restarts, so previously-delivered tasks no longer replay as fresh reminders after restart.
- Async `bash_watch` exit notifications render as `task X exited` instead of the prior `matched "exited (exit 0)"` framing.
- The release script blocks minor-version releases when the in-plugin `ANNOUNCEMENT_VERSION` is stale relative to the release tag.
38 changes: 38 additions & 0 deletions .alfonso/release-notes/v0.30.1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# v0.30.1

Patch release. Three classes of user-facing fixes: bash PTY parameter handling, LSP failure diagnostics, and Windows plugin auto-update.

## Bash — PTY parameter handling

Agents that defensively included `ptyRows` or `ptyCols` on regular (non-PTY) bash calls were hitting a strict validation error. Some models tried to "fix" it by adding `pty: true` to non-interactive commands, which auto-promoted them to background and broke inline output.

- `ptyRows` and `ptyCols` are now soft-ignored when `pty` is unset or false. The dimensions are only applied when a PTY is actually requested.
- `pty: true` now implies `background: true`. The two flags no longer have to be set together.
- Out-of-range or non-integer values return a clean error naming the allowed bounds (e.g. `ptyRows must be an integer between 1 and 60`).
- Tool descriptions for `ptyRows`/`ptyCols` clarify they apply only when `pty: true`.

## Plugin tool schemas

All optional numeric parameters across the OpenCode plugin (bash, read, aft_search, aft_navigate, aft_zoom, aft_outline, refactor, lsp_diagnostics) now use a JSON-Schema-representable bounded integer schema. Empty sentinels (null, empty string, zero) are rejected at validation with a clear message instead of silently being coerced or — as in an earlier internal build — causing the plugin to fail to load.

A schema-conversion regression test now covers every registered tool, so any future change that introduces an unrepresentable shape will fail before release.

## LSP — failure visibility

When an LSP server fails to start, AFT's response now surfaces stderr output captured from the child process. Previously, broken language-server shims (such as a `typescript-language-server` whose `cli.mjs` was missing) returned opaque `spawn_failed` errors without context.

- Stderr from LSP children is captured in a bounded ring buffer and included in failure responses.
- When stderr contains `MODULE_NOT_FOUND`, the response adds a hint pointing at the likely fix (reinstall the package-manager binary, or check the `lsp.servers.<name>.binary` path).
- Clients that crash after a successful initialize are now marked as failed so subsequent file requests stop re-issuing pulls against the dead pipe.

## Auto-update on Windows

Plugin self-update used `spawn("npm")` directly, which fails on Windows because the binary is `npm.cmd`. The auto-update path now resolves `npm.cmd` on Windows (same fix shape as the v0.28.2 LSP install correction).

- `npm install` stderr is captured on failure for diagnostic visibility.
- `--ignore-scripts` is now passed to the install (matches the LSP install hardening).

## Other

- `aft_outline`/`aft_zoom` URL fetch keeps the 15-second body-stall safety net that landed in v0.30.0.
- Subagent sessions continue to silently convert `background: true` bash to foreground (introduced in v0.30.0), because subagents have no completion-reminder mechanism.
Loading