Skip to content

release: indexing OOM fix + resume/reliability + faster embedding#79

Merged
dvcdsys merged 8 commits into
mainfrom
develop
Jun 6, 2026
Merged

release: indexing OOM fix + resume/reliability + faster embedding#79
dvcdsys merged 8 commits into
mainfrom
develop

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Jun 6, 2026

Promote developmain for a server release. Bundles PR #78 (the only delta over main).

What's in here

  • fix(deps): bump gotreesitter v0.20.2 — fixes the 9–100 GB RSS OOM on large external repos (e.g. vscode); index now stable at ~2.5 GB.
  • feat(indexer): parallel cross-file embedding pipeline (PREPARE → EMBED → WRITE); idle-based 10-min session TTL (no wall-clock kill of long indexes); resume-on-restart via stored file SHAs + ForceFull; orphaned-job recovery + ReconcileStuckProjects; cancellable jobs context for clean shutdown.
  • feat(config): dashboard-tunable embed batch size (index_embed_batch_chunks), wired through config → DB (migration 15) → runtimecfg → admin API → OpenAPI → dashboard.
  • feat(server): opt-in localhost pprof endpoint (CIX_PPROF_ADDR, off by default).
  • fix(indexer): release session on mid-walk abort (FailIndexing) so retries re-enter via reconcile-resume instead of ErrSessionConflict; ride out embedding-queue ErrBusy backpressure with bounded backoff; Voyage planning headroom 100K→80K to stop bisect churn on dense code.
  • docs: review follow-ups — corrected force-full routing comment; documented cross-file-batch group coupling.

Verification

⚠️ Deploy notes

  • Re-index required — newer grammars change chunk output for some languages.
  • Known caveat: gotreesitter ≥ v0.19.0 C-enum regression degrades function chunks to module in C files containing an enum (content still searchable). Tracked via skipped TestChunkFile_C_EnumRegression.

🤖 Generated with Claude Code

dvcdsys and others added 8 commits June 6, 2026 22:26
Indexing a large repo (e.g. github.com/microsoft/vscode) drove cix-server
RSS to 9-100 GB and stalled at 0 files processed. Profiling (inuse_space)
showed the live heap was almost entirely gotreesitter parser tables, arenas
and GLR scratch: Parse's error-recovery/snippet machinery calls NewParser
repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for
TypeScript) every call; the copies accumulated in process-global parser pools
faster than the GC could reclaim them.

The old pinned version predates the upstream fixes for exactly this. Bumping
to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge
caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing.

Known caveat: gotreesitter >= v0.19.0 has a C-grammar regression where an
`enum` corrupts the surrounding parse, so functions in such files degrade
from `function` chunks to generic `module` chunks (content still indexed).
Documented and tracked via skipped TestChunkFile_C_EnumRegression.

Re-index required: the newer grammars change chunk output for some languages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add index_embed_batch_chunks to the runtime config layer (DB -> env ->
recommended) so operators can tune cross-file embedding batch size from the
dashboard Advanced section instead of only via CIX_INDEX_EMBED_BATCH_CHUNKS.

Wires the field through config, runtimecfg (Snapshot/Patch/Recommended/Get/Set/
ApplyTo), the runtime_settings table (schema + idempotent migration), the admin
runtime-config API (payloads + validation), the OpenAPI spec + generated stubs,
and the dashboard UI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Overhauls the indexing pipeline for correctness on restart and throughput on
large repos:

- Parallel, cross-file-batched embedding: ProcessFilesStreaming splits into
  PREPARE (sequential chunk) / EMBED (parallel worker pool, cross-file batched
  via planEmbedGroups) / WRITE (serial vector-store + per-file DB tx), driven by
  a tunable concurrency + batch size (SetEmbedTuningLookup).

- Idle-based session TTL (10 min): sessions are reaped only after no file has
  been processed for the TTL window (measured against lastActivity, bumped per
  file), so an actively-progressing multi-hour index is never aborted. Replaces
  the old wall-clock cap that silently killed long indexes.

- Resume on restart: repoindexer reconciles against stored file SHAs and skips
  unchanged files instead of re-scanning from zero. IndexDir gains an explicit
  `wipe` flag; ClonePayload.ForceFull drives the dashboard "full reindex".

- Honest stuck-state recovery: recoverOrphanedJobs requeues 'running' jobs on
  boot; ReconcileStuckProjects flips externally-driven projects left in
  'indexing' with no job to 'error' so the operator can Sync. A gone-tombstone
  (ConsumeGoneReason) lets the job layer distinguish user-cancel from failure.

- Clean shutdown: jobs run under a cancellable context cancelled before Stop,
  ending the SQLITE_INTERRUPT ("interrupted (9)") log flood and Killed-on-hang.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a localhost-only net/http/pprof listener, started only when CIX_PPROF_ADDR
is set (off by default). Used to diagnose the indexing OOM (heap inuse_space
profiling pinpointed gotreesitter as the allocator); kept as a standing
diagnostic for future memory/CPU investigations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A mid-run failure in repoindexer.IndexDir (e.g. a transient "embedding
queue saturated" backpressure error, which is by-design retryable) left
the indexer session status="active" until ttlCleanup reaped it after the
idle timeout. In that window every queue retry and every manual Sync
called BeginIndexing, found the orphaned session, and failed with
ErrSessionConflict ("session already active") — the failure that
triggered the retry also blocked it, stranding the project in 'error'
for minutes until the idle reap.

Add Service.FailIndexing(projectPath, runID): releases the in-memory
session immediately and marks the run 'failed', without flipping
projects.status (repojobs owns the terminal state) and without a
"user-cancel" tombstone (so a later ErrNoSession still reads as an
involuntary loss → retry/resume). IndexDir now defers it on every abort
path, idempotently no-opping on success and on already-removed sessions
(force-stop / idle reap). The retry then re-enters via reconcile-resume
instead of bouncing off the conflict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two changes that stop a transient "embedding queue saturated" from
killing a server-side repo index.

1. repoindexer: treat ErrBusy as retryable backpressure.
   ErrBusy ("retry after Ns") is the queue asking the caller to slow
   down — the HTTP/CLI path honours it via 503 + Retry-After, but the
   in-process driver propagated it as a permanent walk failure. Combined
   with the (now-fixed) session leak, one transient saturation stranded
   the project in 'error' for minutes. flush() now backs off the hinted
   delay (capped at 30s, total bounded at 5m) and retries the SAME batch;
   ProcessFiles bumps lastActivity in its prepare stage so the waits
   don't trip the idle reaper. Any non-busy error still returns at once.

2. voyage: restore real planning headroom (defaultMaxTokensPerBatch
   100K → 80K). estimateTokens divides bytes by 2, but dense code runs
   as hot as ~1.4 bytes/token, so a POST can carry ~1.43x the estimated
   tokens. The old 100K target (17% headroom under Voyage's 120K cap)
   couldn't cover that ~43% undercount, so estimated-95K batches shipped
   as real-122K POSTs and tripped the cap, forcing adaptive bisect+retry
   on every hot file. 80K keeps the worst case (~114K) under the cap;
   bisecting drops back to a true outlier safety net. Cost: ~20% smaller
   batches on prose.

Tests: IndexDir rides out an ErrBusy-then-OK embedder and still indexes
the file; voyage TPM/burst comments updated for the 80K budget.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two comment-only fixes surfaced in review of the consolidated PR:

- gitrepos.ReindexProject: the comment claimed clearing indexed_sha routes
  the clone handler "through the full-reindex branch". Since the reconcile
  rework, an empty IndexedSHA routes to reconcile (resume); what actually
  forces a full wipe is ClonePayload.ForceFull, checked first in handleClone's
  mode switch. Clarify that the indexed_sha clear is purely for dashboard
  "uncommitted" immediacy, not mode determination.

- indexer.embedPrepared: document that cross-file batching couples a group's
  fate on a NON-fatal embed error (one file's failure skips the whole group
  this pass), why it's acceptable (reconcile retries skipped files), and the
  residual edge (a persistently-failing file can poison its grouped neighbours).

No behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix: indexing OOM on large repos + resume/reliability + faster embedding
@dvcdsys dvcdsys merged commit 47df3f5 into main Jun 6, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant