Skip to content

fix: indexing OOM on large repos + resume/reliability + faster embedding#78

Merged
dvcdsys merged 7 commits into
developfrom
fix/indexing-oom-resume-speed
Jun 6, 2026
Merged

fix: indexing OOM on large repos + resume/reliability + faster embedding#78
dvcdsys merged 7 commits into
developfrom
fix/indexing-oom-resume-speed

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Jun 6, 2026

Summary

Indexing a large external repo (reproduced with github.com/microsoft/vscode) drove cix-server RSS to 9–100 GB and stalled at 0 files processed, taking the host down with it. This PR fixes the OOM and bundles the related indexing reliability/throughput work — including a follow-up round of fixes for a session-leak + embedding-backpressure deadlock surfaced while indexing vscode under the Voyage provider.

Root cause (the OOM)

Heap profiling (-inuse_space) showed the live heap was almost entirely gotreesitter parser tables, node arenas and GLR scratch. Parse's error-recovery/snippet machinery calls NewParser repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for TypeScript) every call; the copies piled up in process-global parser pools faster than the GC could reclaim them. The isolated chunker never reproduced it (~1.8 GB) — only the live server's runtime did.

The pinned gotreesitter predated the upstream fixes for exactly this. Bumping to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing.

Root cause (the stuck-error deadlock)

While indexing vscode under Voyage, a single transient embedding queue saturated, retry after 5s (ErrBusy) failed the whole run. Two compounding defects turned a retryable backpressure signal into a multi-minute outage:

  1. Session leakrepoindexer.IndexDir released the indexer session only on success (FinishIndexing) or explicit force-stop. Any mid-run error left the session active until the idle-timeout reaped it, so every queue retry and every manual Sync bounced off ErrSessionConflict ("session already active") in the meantime. The failure that triggered the retry also blocked it.
  2. ErrBusy treated as fatal — the HTTP/CLI path honours ErrBusy via 503 + Retry-After, but the in-process driver propagated it as a permanent walk failure with no retry.

Fixed by releasing the session on abort (FailIndexing) and by riding out ErrBusy with hinted backoff in-process. A related Voyage planning-headroom fix removes the bisect noise that aggravated the saturation.

What's in here (commit-by-commit)

  1. fix(deps) — bump gotreesitter v0.20.2 (the OOM fix).
  2. feat(config) — dashboard-tunable embed batch size (index_embed_batch_chunks).
  3. feat(indexer) — parallel cross-file embedding pipeline; idle-based 10-min session TTL (no more silent wall-clock kill of long indexes); resume-on-restart via stored file SHAs + ForceFull; orphaned-job recovery + ReconcileStuckProjects; cancellable jobs context for clean shutdown.
  4. feat(server) — opt-in localhost pprof endpoint (CIX_PPROF_ADDR).
  5. fix(indexer) — release session when an in-process run aborts mid-walk. New Service.FailIndexing(projectPath, runID) removes the active session immediately and marks the run failed, without flipping projects.status (repojobs owns the terminal state) or setting a user-cancel tombstone. IndexDir defers it on every abort path; idempotent on success / already-removed sessions. Retries now re-enter via reconcile-resume instead of ErrSessionConflict.
  6. fix(indexer) — ride out embedding-queue backpressure instead of failing. flush() now backs off the ErrBusy Retry-After hint (capped 30s, total bounded 5m) and retries the same batch; non-busy errors still return at once. Voyage defaultMaxTokensPerBatch 100K → 80K: estimateTokens divides bytes by 2 but dense code runs ~1.4 bytes/token (a ~43% undercount), so the old 17%-headroom target shipped estimated-95K batches as real-122K POSTs that tripped Voyage's 120K cap and forced adaptive bisect+retry on every hot file. 80K keeps the worst case (~114K) under the cap.

Verification

  • vscode index: ~2.5 GB stable (was 9–100 GB / stalled).
  • Full server test suite: 40 packages pass, 0 failures.
  • New regression tests: TestFailIndexing_ReleasesSessionForRetry (orphaned session released for retry, run marked failed, idempotent) and TestIndexDir_RetriesEmbeddingQueueSaturation (rides out an ErrBusy-then-OK embedder and still indexes the file).

⚠️ Notes for reviewer / deploy

  • Re-index required — the newer grammars change chunk output for some languages.
  • Known caveat: gotreesitter ≥ v0.19.0 has a C-grammar regression where an enum corrupts the surrounding parse, so functions in such files degrade from function chunks to generic module chunks (content still indexed/searchable). Documented + tracked via skipped TestChunkFile_C_EnumRegression; worth filing upstream.

🤖 Generated with Claude Code

dvcdsys and others added 6 commits June 6, 2026 22:26
Indexing a large repo (e.g. github.com/microsoft/vscode) drove cix-server
RSS to 9-100 GB and stalled at 0 files processed. Profiling (inuse_space)
showed the live heap was almost entirely gotreesitter parser tables, arenas
and GLR scratch: Parse's error-recovery/snippet machinery calls NewParser
repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for
TypeScript) every call; the copies accumulated in process-global parser pools
faster than the GC could reclaim them.

The old pinned version predates the upstream fixes for exactly this. Bumping
to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge
caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing.

Known caveat: gotreesitter >= v0.19.0 has a C-grammar regression where an
`enum` corrupts the surrounding parse, so functions in such files degrade
from `function` chunks to generic `module` chunks (content still indexed).
Documented and tracked via skipped TestChunkFile_C_EnumRegression.

Re-index required: the newer grammars change chunk output for some languages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add index_embed_batch_chunks to the runtime config layer (DB -> env ->
recommended) so operators can tune cross-file embedding batch size from the
dashboard Advanced section instead of only via CIX_INDEX_EMBED_BATCH_CHUNKS.

Wires the field through config, runtimecfg (Snapshot/Patch/Recommended/Get/Set/
ApplyTo), the runtime_settings table (schema + idempotent migration), the admin
runtime-config API (payloads + validation), the OpenAPI spec + generated stubs,
and the dashboard UI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Overhauls the indexing pipeline for correctness on restart and throughput on
large repos:

- Parallel, cross-file-batched embedding: ProcessFilesStreaming splits into
  PREPARE (sequential chunk) / EMBED (parallel worker pool, cross-file batched
  via planEmbedGroups) / WRITE (serial vector-store + per-file DB tx), driven by
  a tunable concurrency + batch size (SetEmbedTuningLookup).

- Idle-based session TTL (10 min): sessions are reaped only after no file has
  been processed for the TTL window (measured against lastActivity, bumped per
  file), so an actively-progressing multi-hour index is never aborted. Replaces
  the old wall-clock cap that silently killed long indexes.

- Resume on restart: repoindexer reconciles against stored file SHAs and skips
  unchanged files instead of re-scanning from zero. IndexDir gains an explicit
  `wipe` flag; ClonePayload.ForceFull drives the dashboard "full reindex".

- Honest stuck-state recovery: recoverOrphanedJobs requeues 'running' jobs on
  boot; ReconcileStuckProjects flips externally-driven projects left in
  'indexing' with no job to 'error' so the operator can Sync. A gone-tombstone
  (ConsumeGoneReason) lets the job layer distinguish user-cancel from failure.

- Clean shutdown: jobs run under a cancellable context cancelled before Stop,
  ending the SQLITE_INTERRUPT ("interrupted (9)") log flood and Killed-on-hang.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a localhost-only net/http/pprof listener, started only when CIX_PPROF_ADDR
is set (off by default). Used to diagnose the indexing OOM (heap inuse_space
profiling pinpointed gotreesitter as the allocator); kept as a standing
diagnostic for future memory/CPU investigations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A mid-run failure in repoindexer.IndexDir (e.g. a transient "embedding
queue saturated" backpressure error, which is by-design retryable) left
the indexer session status="active" until ttlCleanup reaped it after the
idle timeout. In that window every queue retry and every manual Sync
called BeginIndexing, found the orphaned session, and failed with
ErrSessionConflict ("session already active") — the failure that
triggered the retry also blocked it, stranding the project in 'error'
for minutes until the idle reap.

Add Service.FailIndexing(projectPath, runID): releases the in-memory
session immediately and marks the run 'failed', without flipping
projects.status (repojobs owns the terminal state) and without a
"user-cancel" tombstone (so a later ErrNoSession still reads as an
involuntary loss → retry/resume). IndexDir now defers it on every abort
path, idempotently no-opping on success and on already-removed sessions
(force-stop / idle reap). The retry then re-enters via reconcile-resume
instead of bouncing off the conflict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two changes that stop a transient "embedding queue saturated" from
killing a server-side repo index.

1. repoindexer: treat ErrBusy as retryable backpressure.
   ErrBusy ("retry after Ns") is the queue asking the caller to slow
   down — the HTTP/CLI path honours it via 503 + Retry-After, but the
   in-process driver propagated it as a permanent walk failure. Combined
   with the (now-fixed) session leak, one transient saturation stranded
   the project in 'error' for minutes. flush() now backs off the hinted
   delay (capped at 30s, total bounded at 5m) and retries the SAME batch;
   ProcessFiles bumps lastActivity in its prepare stage so the waits
   don't trip the idle reaper. Any non-busy error still returns at once.

2. voyage: restore real planning headroom (defaultMaxTokensPerBatch
   100K → 80K). estimateTokens divides bytes by 2, but dense code runs
   as hot as ~1.4 bytes/token, so a POST can carry ~1.43x the estimated
   tokens. The old 100K target (17% headroom under Voyage's 120K cap)
   couldn't cover that ~43% undercount, so estimated-95K batches shipped
   as real-122K POSTs and tripped the cap, forcing adaptive bisect+retry
   on every hot file. 80K keeps the worst case (~114K) under the cap;
   bisecting drops back to a true outlier safety net. Cost: ~20% smaller
   batches on prose.

Tests: IndexDir rides out an ErrBusy-then-OK embedder and still indexes
the file; voyage TPM/burst comments updated for the 80K budget.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dvcdsys
Copy link
Copy Markdown
Owner Author

dvcdsys commented Jun 6, 2026

Pushed two follow-up commits (963df02, d507e17) fixing a session-leak + embedding-backpressure deadlock found while indexing vscode under Voyage:

  • Session leak — a mid-run abort (e.g. transient embedding queue saturated) left the indexer session active until the idle-timeout, so every retry/Sync bounced off ErrSessionConflict. IndexDir now releases the session on abort via FailIndexing.
  • ErrBusy was fatal in-processflush() now backs off the Retry-After hint and retries the batch (the HTTP/CLI path already did this via 503).
  • Voyage headroomdefaultMaxTokensPerBatch 100K → 80K so the byte→token estimator's ~43% undercount on dense code stops tripping Voyage's 120K cap and forcing bisect+retry.

Two new regression tests; full suite still 40 pkgs / 0 failures. Body updated with the details.

Two comment-only fixes surfaced in review of the consolidated PR:

- gitrepos.ReindexProject: the comment claimed clearing indexed_sha routes
  the clone handler "through the full-reindex branch". Since the reconcile
  rework, an empty IndexedSHA routes to reconcile (resume); what actually
  forces a full wipe is ClonePayload.ForceFull, checked first in handleClone's
  mode switch. Clarify that the indexed_sha clear is purely for dashboard
  "uncommitted" immediacy, not mode determination.

- indexer.embedPrepared: document that cross-file batching couples a group's
  fate on a NON-fatal embed error (one file's failure skips the whole group
  this pass), why it's acceptable (reconcile retries skipped files), and the
  residual edge (a persistently-failing file can poison its grouped neighbours).

No behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dvcdsys dvcdsys merged commit a793202 into develop Jun 6, 2026
1 check passed
@dvcdsys dvcdsys deleted the fix/indexing-oom-resume-speed branch June 6, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant