release: indexing OOM fix + resume/reliability + faster embedding by dvcdsys · Pull Request #79 · dvcdsys/code-index

dvcdsys · 2026-06-06T22:15:12Z

Promote develop → main for a server release. Bundles PR #78 (the only delta over main).

What's in here

fix(deps): bump gotreesitter v0.20.2 — fixes the 9–100 GB RSS OOM on large external repos (e.g. vscode); index now stable at ~2.5 GB.
feat(indexer): parallel cross-file embedding pipeline (PREPARE → EMBED → WRITE); idle-based 10-min session TTL (no wall-clock kill of long indexes); resume-on-restart via stored file SHAs + ForceFull; orphaned-job recovery + ReconcileStuckProjects; cancellable jobs context for clean shutdown.
feat(config): dashboard-tunable embed batch size (index_embed_batch_chunks), wired through config → DB (migration 15) → runtimecfg → admin API → OpenAPI → dashboard.
feat(server): opt-in localhost pprof endpoint (CIX_PPROF_ADDR, off by default).
fix(indexer): release session on mid-walk abort (FailIndexing) so retries re-enter via reconcile-resume instead of ErrSessionConflict; ride out embedding-queue ErrBusy backpressure with bounded backoff; Voyage planning headroom 100K→80K to stop bisect churn on dense code.
docs: review follow-ups — corrected force-full routing comment; documented cross-file-batch group coupling.

Verification

go build ./..., go vet, go test -race on all changed packages — green.
make openapi-check — generated stubs in sync.
Full server suite: 40 packages pass (per PR fix: indexing OOM on large repos + resume/reliability + faster embedding #78).

⚠️ Deploy notes

Re-index required — newer grammars change chunk output for some languages.
Known caveat: gotreesitter ≥ v0.19.0 C-enum regression degrades function chunks to module in C files containing an enum (content still searchable). Tracked via skipped TestChunkFile_C_EnumRegression.

🤖 Generated with Claude Code

Indexing a large repo (e.g. github.com/microsoft/vscode) drove cix-server RSS to 9-100 GB and stalled at 0 files processed. Profiling (inuse_space) showed the live heap was almost entirely gotreesitter parser tables, arenas and GLR scratch: Parse's error-recovery/snippet machinery calls NewParser repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for TypeScript) every call; the copies accumulated in process-global parser pools faster than the GC could reclaim them. The old pinned version predates the upstream fixes for exactly this. Bumping to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing. Known caveat: gotreesitter >= v0.19.0 has a C-grammar regression where an `enum` corrupts the surrounding parse, so functions in such files degrade from `function` chunks to generic `module` chunks (content still indexed). Documented and tracked via skipped TestChunkFile_C_EnumRegression. Re-index required: the newer grammars change chunk output for some languages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add index_embed_batch_chunks to the runtime config layer (DB -> env -> recommended) so operators can tune cross-file embedding batch size from the dashboard Advanced section instead of only via CIX_INDEX_EMBED_BATCH_CHUNKS. Wires the field through config, runtimecfg (Snapshot/Patch/Recommended/Get/Set/ ApplyTo), the runtime_settings table (schema + idempotent migration), the admin runtime-config API (payloads + validation), the OpenAPI spec + generated stubs, and the dashboard UI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Overhauls the indexing pipeline for correctness on restart and throughput on large repos: - Parallel, cross-file-batched embedding: ProcessFilesStreaming splits into PREPARE (sequential chunk) / EMBED (parallel worker pool, cross-file batched via planEmbedGroups) / WRITE (serial vector-store + per-file DB tx), driven by a tunable concurrency + batch size (SetEmbedTuningLookup). - Idle-based session TTL (10 min): sessions are reaped only after no file has been processed for the TTL window (measured against lastActivity, bumped per file), so an actively-progressing multi-hour index is never aborted. Replaces the old wall-clock cap that silently killed long indexes. - Resume on restart: repoindexer reconciles against stored file SHAs and skips unchanged files instead of re-scanning from zero. IndexDir gains an explicit `wipe` flag; ClonePayload.ForceFull drives the dashboard "full reindex". - Honest stuck-state recovery: recoverOrphanedJobs requeues 'running' jobs on boot; ReconcileStuckProjects flips externally-driven projects left in 'indexing' with no job to 'error' so the operator can Sync. A gone-tombstone (ConsumeGoneReason) lets the job layer distinguish user-cancel from failure. - Clean shutdown: jobs run under a cancellable context cancelled before Stop, ending the SQLITE_INTERRUPT ("interrupted (9)") log flood and Killed-on-hang. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a localhost-only net/http/pprof listener, started only when CIX_PPROF_ADDR is set (off by default). Used to diagnose the indexing OOM (heap inuse_space profiling pinpointed gotreesitter as the allocator); kept as a standing diagnostic for future memory/CPU investigations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A mid-run failure in repoindexer.IndexDir (e.g. a transient "embedding queue saturated" backpressure error, which is by-design retryable) left the indexer session status="active" until ttlCleanup reaped it after the idle timeout. In that window every queue retry and every manual Sync called BeginIndexing, found the orphaned session, and failed with ErrSessionConflict ("session already active") — the failure that triggered the retry also blocked it, stranding the project in 'error' for minutes until the idle reap. Add Service.FailIndexing(projectPath, runID): releases the in-memory session immediately and marks the run 'failed', without flipping projects.status (repojobs owns the terminal state) and without a "user-cancel" tombstone (so a later ErrNoSession still reads as an involuntary loss → retry/resume). IndexDir now defers it on every abort path, idempotently no-opping on success and on already-removed sessions (force-stop / idle reap). The retry then re-enters via reconcile-resume instead of bouncing off the conflict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Two changes that stop a transient "embedding queue saturated" from killing a server-side repo index. 1. repoindexer: treat ErrBusy as retryable backpressure. ErrBusy ("retry after Ns") is the queue asking the caller to slow down — the HTTP/CLI path honours it via 503 + Retry-After, but the in-process driver propagated it as a permanent walk failure. Combined with the (now-fixed) session leak, one transient saturation stranded the project in 'error' for minutes. flush() now backs off the hinted delay (capped at 30s, total bounded at 5m) and retries the SAME batch; ProcessFiles bumps lastActivity in its prepare stage so the waits don't trip the idle reaper. Any non-busy error still returns at once. 2. voyage: restore real planning headroom (defaultMaxTokensPerBatch 100K → 80K). estimateTokens divides bytes by 2, but dense code runs as hot as ~1.4 bytes/token, so a POST can carry ~1.43x the estimated tokens. The old 100K target (17% headroom under Voyage's 120K cap) couldn't cover that ~43% undercount, so estimated-95K batches shipped as real-122K POSTs and tripped the cap, forcing adaptive bisect+retry on every hot file. 80K keeps the worst case (~114K) under the cap; bisecting drops back to a true outlier safety net. Cost: ~20% smaller batches on prose. Tests: IndexDir rides out an ErrBusy-then-OK embedder and still indexes the file; voyage TPM/burst comments updated for the 80K budget. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Two comment-only fixes surfaced in review of the consolidated PR: - gitrepos.ReindexProject: the comment claimed clearing indexed_sha routes the clone handler "through the full-reindex branch". Since the reconcile rework, an empty IndexedSHA routes to reconcile (resume); what actually forces a full wipe is ClonePayload.ForceFull, checked first in handleClone's mode switch. Clarify that the indexed_sha clear is purely for dashboard "uncommitted" immediacy, not mode determination. - indexer.embedPrepared: document that cross-file batching couples a group's fate on a NON-fatal embed error (one file's failure skips the whole group this pass), why it's acceptable (reconcile retries skipped files), and the residual edge (a persistently-failing file can poison its grouped neighbours). No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix: indexing OOM on large repos + resume/reliability + faster embedding

dvcdsys and others added 8 commits June 6, 2026 22:26

Merge pull request #78 from dvcdsys/fix/indexing-oom-resume-speed

a793202

fix: indexing OOM on large repos + resume/reliability + faster embedding

dvcdsys merged commit 47df3f5 into main Jun 6, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: indexing OOM fix + resume/reliability + faster embedding#79

release: indexing OOM fix + resume/reliability + faster embedding#79
dvcdsys merged 8 commits into
mainfrom
develop

dvcdsys commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dvcdsys commented Jun 6, 2026

What's in here

Verification

⚠️ Deploy notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant