poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81
Draft
dvcdsys wants to merge 1 commit into
Draft
poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81dvcdsys wants to merge 1 commit into
dvcdsys wants to merge 1 commit into
Conversation
Alternative to feat/chunker-cgo-treesitter: the official tree-sitter C runtime + TypeScript grammar compiled to a standalone wasm32-wasi reactor module (build.sh, via zig cc) and driven from Go through wazero — no cgo, no JS, no third-party parser. Only the wazero host (wasmts.go) is bespoke; the parser is unmodified upstream C. wasm_store.c is gated by TREE_SITTER_FEATURE_WASM (we don't define it), so the stock amalgamation compiles to wasi with no stubs. Measured on the same 852-file vscode TypeScript corpus (full-tree walk): backend wall files/s ERROR trees editorOptions.ts gotreesitter (pure-Go) 13.83s 62 13 8.77s -> ERROR WASM (wazero, pure-Go) ~2.5s ~330 0 49ms cgo (native) 1.26s 675 0 17ms - WASM ~2x slower than cgo, ~5x faster than gotreesitter, correct (0 errors). - Overhead is the per-node host<->guest call boundary (~3 calls/node x 2.68M nodes), not memory — slot-pooling barely moved it. A batched "serialize subtree" export would close most of the gap (future work). - Stability: tree-sitter is robust on adversarial input under both backends; WASM additionally CONTAINS faults (resource/guest trap -> recoverable Go error, host alive) where cgo would SIGSEGV the whole process. Insurance vs unknown C bugs, not a fix for an observed crash. Trade-off vs cgo: ~2x parse cost (largely invisible end-to-end since embeddings dominate) in exchange for CGO_ENABLED=0 builds, crash-isolation, and a likely smaller binary; cost is the engineering effort to build/bundle all 31 grammars and flesh out the node API. README.md has the full comparison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft / PoC for comparison — not for merge. Alternative to the cgo backend in #80, to decide direction.
Official tree-sitter C runtime + TypeScript grammar → standalone
wasm32-wasimodule (zig cc), driven from Go via wazero. No cgo, no JS, no third-party parser — only the wazero host (poc/wasm-treesitter/wasmts.go) is ours.Speed — same 852-file vscode TS corpus, full-tree walk
editorOptions.ts~2× slower than cgo, ~5× faster than gotreesitter, correct. Overhead is the per-node host↔guest call boundary (mitigable with a batched subtree export).
Stability
tree-sitter is robust on adversarial input under both backends. WASM additionally contains guest faults (resource/trap → recoverable Go error, host alive) where cgo would SIGSEGV the whole process. Insurance vs unknown C bugs.
Decision framing
~2× parse cost (largely invisible end-to-end — embeddings dominate) in exchange for
CGO_ENABLED=0builds, crash-isolation, and a likely smaller binary. Cost: engineering effort to build/bundle all 31 grammars + flesh out the node API. Full write-up inpoc/wasm-treesitter/README.md.🤖 Generated with Claude Code