fix: indexing OOM on large repos + resume/reliability + faster embedding by dvcdsys · Pull Request #78 · dvcdsys/code-index

dvcdsys · 2026-06-06T21:28:39Z

Summary

Indexing a large external repo (reproduced with github.com/microsoft/vscode) drove cix-server RSS to 9–100 GB and stalled at 0 files processed, taking the host down with it. This PR fixes the OOM and bundles the related indexing reliability/throughput work — including a follow-up round of fixes for a session-leak + embedding-backpressure deadlock surfaced while indexing vscode under the Voyage provider.

Root cause (the OOM)

Heap profiling (-inuse_space) showed the live heap was almost entirely gotreesitter parser tables, node arenas and GLR scratch. Parse's error-recovery/snippet machinery calls NewParser repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for TypeScript) every call; the copies piled up in process-global parser pools faster than the GC could reclaim them. The isolated chunker never reproduced it (~1.8 GB) — only the live server's runtime did.

The pinned gotreesitter predated the upstream fixes for exactly this. Bumping to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing.

Root cause (the stuck-`error` deadlock)

While indexing vscode under Voyage, a single transient embedding queue saturated, retry after 5s (ErrBusy) failed the whole run. Two compounding defects turned a retryable backpressure signal into a multi-minute outage:

Session leak — repoindexer.IndexDir released the indexer session only on success (FinishIndexing) or explicit force-stop. Any mid-run error left the session active until the idle-timeout reaped it, so every queue retry and every manual Sync bounced off ErrSessionConflict ("session already active") in the meantime. The failure that triggered the retry also blocked it.
ErrBusy treated as fatal — the HTTP/CLI path honours ErrBusy via 503 + Retry-After, but the in-process driver propagated it as a permanent walk failure with no retry.

Fixed by releasing the session on abort (FailIndexing) and by riding out ErrBusy with hinted backoff in-process. A related Voyage planning-headroom fix removes the bisect noise that aggravated the saturation.

What's in here (commit-by-commit)

fix(deps) — bump gotreesitter v0.20.2 (the OOM fix).
feat(config) — dashboard-tunable embed batch size (index_embed_batch_chunks).
feat(indexer) — parallel cross-file embedding pipeline; idle-based 10-min session TTL (no more silent wall-clock kill of long indexes); resume-on-restart via stored file SHAs + ForceFull; orphaned-job recovery + ReconcileStuckProjects; cancellable jobs context for clean shutdown.
feat(server) — opt-in localhost pprof endpoint (CIX_PPROF_ADDR).
fix(indexer) — release session when an in-process run aborts mid-walk. New Service.FailIndexing(projectPath, runID) removes the active session immediately and marks the run failed, without flipping projects.status (repojobs owns the terminal state) or setting a user-cancel tombstone. IndexDir defers it on every abort path; idempotent on success / already-removed sessions. Retries now re-enter via reconcile-resume instead of ErrSessionConflict.
fix(indexer) — ride out embedding-queue backpressure instead of failing. flush() now backs off the ErrBusy Retry-After hint (capped 30s, total bounded 5m) and retries the same batch; non-busy errors still return at once. Voyage defaultMaxTokensPerBatch 100K → 80K: estimateTokens divides bytes by 2 but dense code runs ~1.4 bytes/token (a ~43% undercount), so the old 17%-headroom target shipped estimated-95K batches as real-122K POSTs that tripped Voyage's 120K cap and forced adaptive bisect+retry on every hot file. 80K keeps the worst case (~114K) under the cap.

Verification

vscode index: ~2.5 GB stable (was 9–100 GB / stalled).
Full server test suite: 40 packages pass, 0 failures.
New regression tests: TestFailIndexing_ReleasesSessionForRetry (orphaned session released for retry, run marked failed, idempotent) and TestIndexDir_RetriesEmbeddingQueueSaturation (rides out an ErrBusy-then-OK embedder and still indexes the file).

⚠️ Notes for reviewer / deploy

Re-index required — the newer grammars change chunk output for some languages.
Known caveat: gotreesitter ≥ v0.19.0 has a C-grammar regression where an enum corrupts the surrounding parse, so functions in such files degrade from function chunks to generic module chunks (content still indexed/searchable). Documented + tracked via skipped TestChunkFile_C_EnumRegression; worth filing upstream.

🤖 Generated with Claude Code

Indexing a large repo (e.g. github.com/microsoft/vscode) drove cix-server RSS to 9-100 GB and stalled at 0 files processed. Profiling (inuse_space) showed the live heap was almost entirely gotreesitter parser tables, arenas and GLR scratch: Parse's error-recovery/snippet machinery calls NewParser repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for TypeScript) every call; the copies accumulated in process-global parser pools faster than the GC could reclaim them. The old pinned version predates the upstream fixes for exactly this. Bumping to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing. Known caveat: gotreesitter >= v0.19.0 has a C-grammar regression where an `enum` corrupts the surrounding parse, so functions in such files degrade from `function` chunks to generic `module` chunks (content still indexed). Documented and tracked via skipped TestChunkFile_C_EnumRegression. Re-index required: the newer grammars change chunk output for some languages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add index_embed_batch_chunks to the runtime config layer (DB -> env -> recommended) so operators can tune cross-file embedding batch size from the dashboard Advanced section instead of only via CIX_INDEX_EMBED_BATCH_CHUNKS. Wires the field through config, runtimecfg (Snapshot/Patch/Recommended/Get/Set/ ApplyTo), the runtime_settings table (schema + idempotent migration), the admin runtime-config API (payloads + validation), the OpenAPI spec + generated stubs, and the dashboard UI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Overhauls the indexing pipeline for correctness on restart and throughput on large repos: - Parallel, cross-file-batched embedding: ProcessFilesStreaming splits into PREPARE (sequential chunk) / EMBED (parallel worker pool, cross-file batched via planEmbedGroups) / WRITE (serial vector-store + per-file DB tx), driven by a tunable concurrency + batch size (SetEmbedTuningLookup). - Idle-based session TTL (10 min): sessions are reaped only after no file has been processed for the TTL window (measured against lastActivity, bumped per file), so an actively-progressing multi-hour index is never aborted. Replaces the old wall-clock cap that silently killed long indexes. - Resume on restart: repoindexer reconciles against stored file SHAs and skips unchanged files instead of re-scanning from zero. IndexDir gains an explicit `wipe` flag; ClonePayload.ForceFull drives the dashboard "full reindex". - Honest stuck-state recovery: recoverOrphanedJobs requeues 'running' jobs on boot; ReconcileStuckProjects flips externally-driven projects left in 'indexing' with no job to 'error' so the operator can Sync. A gone-tombstone (ConsumeGoneReason) lets the job layer distinguish user-cancel from failure. - Clean shutdown: jobs run under a cancellable context cancelled before Stop, ending the SQLITE_INTERRUPT ("interrupted (9)") log flood and Killed-on-hang. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a localhost-only net/http/pprof listener, started only when CIX_PPROF_ADDR is set (off by default). Used to diagnose the indexing OOM (heap inuse_space profiling pinpointed gotreesitter as the allocator); kept as a standing diagnostic for future memory/CPU investigations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A mid-run failure in repoindexer.IndexDir (e.g. a transient "embedding queue saturated" backpressure error, which is by-design retryable) left the indexer session status="active" until ttlCleanup reaped it after the idle timeout. In that window every queue retry and every manual Sync called BeginIndexing, found the orphaned session, and failed with ErrSessionConflict ("session already active") — the failure that triggered the retry also blocked it, stranding the project in 'error' for minutes until the idle reap. Add Service.FailIndexing(projectPath, runID): releases the in-memory session immediately and marks the run 'failed', without flipping projects.status (repojobs owns the terminal state) and without a "user-cancel" tombstone (so a later ErrNoSession still reads as an involuntary loss → retry/resume). IndexDir now defers it on every abort path, idempotently no-opping on success and on already-removed sessions (force-stop / idle reap). The retry then re-enters via reconcile-resume instead of bouncing off the conflict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Two changes that stop a transient "embedding queue saturated" from killing a server-side repo index. 1. repoindexer: treat ErrBusy as retryable backpressure. ErrBusy ("retry after Ns") is the queue asking the caller to slow down — the HTTP/CLI path honours it via 503 + Retry-After, but the in-process driver propagated it as a permanent walk failure. Combined with the (now-fixed) session leak, one transient saturation stranded the project in 'error' for minutes. flush() now backs off the hinted delay (capped at 30s, total bounded at 5m) and retries the SAME batch; ProcessFiles bumps lastActivity in its prepare stage so the waits don't trip the idle reaper. Any non-busy error still returns at once. 2. voyage: restore real planning headroom (defaultMaxTokensPerBatch 100K → 80K). estimateTokens divides bytes by 2, but dense code runs as hot as ~1.4 bytes/token, so a POST can carry ~1.43x the estimated tokens. The old 100K target (17% headroom under Voyage's 120K cap) couldn't cover that ~43% undercount, so estimated-95K batches shipped as real-122K POSTs and tripped the cap, forcing adaptive bisect+retry on every hot file. 80K keeps the worst case (~114K) under the cap; bisecting drops back to a true outlier safety net. Cost: ~20% smaller batches on prose. Tests: IndexDir rides out an ErrBusy-then-OK embedder and still indexes the file; voyage TPM/burst comments updated for the 80K budget. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dvcdsys · 2026-06-06T21:56:41Z

Pushed two follow-up commits (963df02, d507e17) fixing a session-leak + embedding-backpressure deadlock found while indexing vscode under Voyage:

Session leak — a mid-run abort (e.g. transient embedding queue saturated) left the indexer session active until the idle-timeout, so every retry/Sync bounced off ErrSessionConflict. IndexDir now releases the session on abort via FailIndexing.
ErrBusy was fatal in-process — flush() now backs off the Retry-After hint and retries the batch (the HTTP/CLI path already did this via 503).
Voyage headroom — defaultMaxTokensPerBatch 100K → 80K so the byte→token estimator's ~43% undercount on dense code stops tripping Voyage's 120K cap and forcing bisect+retry.

Two new regression tests; full suite still 40 pkgs / 0 failures. Body updated with the details.

Two comment-only fixes surfaced in review of the consolidated PR: - gitrepos.ReindexProject: the comment claimed clearing indexed_sha routes the clone handler "through the full-reindex branch". Since the reconcile rework, an empty IndexedSHA routes to reconcile (resume); what actually forces a full wipe is ClonePayload.ForceFull, checked first in handleClone's mode switch. Clarify that the indexed_sha clear is purely for dashboard "uncommitted" immediacy, not mode determination. - indexer.embedPrepared: document that cross-file batching couples a group's fate on a NON-fatal embed error (one file's failure skips the whole group this pass), why it's acceptable (reconcile retries skipped files), and the residual edge (a persistently-failing file can poison its grouped neighbours). No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dvcdsys and others added 6 commits June 6, 2026 22:26

dvcdsys merged commit a793202 into develop Jun 6, 2026
1 check passed

dvcdsys deleted the fix/indexing-oom-resume-speed branch June 6, 2026 22:12

This was referenced Jun 6, 2026

release: indexing OOM fix + resume/reliability + faster embedding #79

Merged

feat(chunker): migrate to official tree-sitter via cgo #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: indexing OOM on large repos + resume/reliability + faster embedding#78

fix: indexing OOM on large repos + resume/reliability + faster embedding#78
dvcdsys merged 7 commits into
developfrom
fix/indexing-oom-resume-speed

dvcdsys commented Jun 6, 2026 •

edited

Loading

Uh oh!

dvcdsys commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dvcdsys commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (the OOM)

Root cause (the stuck-error deadlock)

What's in here (commit-by-commit)

Verification

⚠️ Notes for reviewer / deploy

Uh oh!

dvcdsys commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dvcdsys commented Jun 6, 2026 •

edited

Loading

Root cause (the stuck-`error` deadlock)