Conversation
Indexing a large repo (e.g. github.com/microsoft/vscode) drove cix-server RSS to 9-100 GB and stalled at 0 files processed. Profiling (inuse_space) showed the live heap was almost entirely gotreesitter parser tables, arenas and GLR scratch: Parse's error-recovery/snippet machinery calls NewParser repeatedly, and NewParser rebuilds the grammar's full LR tables (~175 MB for TypeScript) every call; the copies accumulated in process-global parser pools faster than the GC could reclaim them. The old pinned version predates the upstream fixes for exactly this. Bumping to v0.20.2 (bounded recovery sub-parses, recovery-parser pooling, GLR merge caps, arena reuse) keeps the vscode index stable at ~2.5 GB and progressing. Known caveat: gotreesitter >= v0.19.0 has a C-grammar regression where an `enum` corrupts the surrounding parse, so functions in such files degrade from `function` chunks to generic `module` chunks (content still indexed). Documented and tracked via skipped TestChunkFile_C_EnumRegression. Re-index required: the newer grammars change chunk output for some languages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add index_embed_batch_chunks to the runtime config layer (DB -> env -> recommended) so operators can tune cross-file embedding batch size from the dashboard Advanced section instead of only via CIX_INDEX_EMBED_BATCH_CHUNKS. Wires the field through config, runtimecfg (Snapshot/Patch/Recommended/Get/Set/ ApplyTo), the runtime_settings table (schema + idempotent migration), the admin runtime-config API (payloads + validation), the OpenAPI spec + generated stubs, and the dashboard UI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Overhauls the indexing pipeline for correctness on restart and throughput on
large repos:
- Parallel, cross-file-batched embedding: ProcessFilesStreaming splits into
PREPARE (sequential chunk) / EMBED (parallel worker pool, cross-file batched
via planEmbedGroups) / WRITE (serial vector-store + per-file DB tx), driven by
a tunable concurrency + batch size (SetEmbedTuningLookup).
- Idle-based session TTL (10 min): sessions are reaped only after no file has
been processed for the TTL window (measured against lastActivity, bumped per
file), so an actively-progressing multi-hour index is never aborted. Replaces
the old wall-clock cap that silently killed long indexes.
- Resume on restart: repoindexer reconciles against stored file SHAs and skips
unchanged files instead of re-scanning from zero. IndexDir gains an explicit
`wipe` flag; ClonePayload.ForceFull drives the dashboard "full reindex".
- Honest stuck-state recovery: recoverOrphanedJobs requeues 'running' jobs on
boot; ReconcileStuckProjects flips externally-driven projects left in
'indexing' with no job to 'error' so the operator can Sync. A gone-tombstone
(ConsumeGoneReason) lets the job layer distinguish user-cancel from failure.
- Clean shutdown: jobs run under a cancellable context cancelled before Stop,
ending the SQLITE_INTERRUPT ("interrupted (9)") log flood and Killed-on-hang.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a localhost-only net/http/pprof listener, started only when CIX_PPROF_ADDR is set (off by default). Used to diagnose the indexing OOM (heap inuse_space profiling pinpointed gotreesitter as the allocator); kept as a standing diagnostic for future memory/CPU investigations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A mid-run failure in repoindexer.IndexDir (e.g. a transient "embedding
queue saturated" backpressure error, which is by-design retryable) left
the indexer session status="active" until ttlCleanup reaped it after the
idle timeout. In that window every queue retry and every manual Sync
called BeginIndexing, found the orphaned session, and failed with
ErrSessionConflict ("session already active") — the failure that
triggered the retry also blocked it, stranding the project in 'error'
for minutes until the idle reap.
Add Service.FailIndexing(projectPath, runID): releases the in-memory
session immediately and marks the run 'failed', without flipping
projects.status (repojobs owns the terminal state) and without a
"user-cancel" tombstone (so a later ErrNoSession still reads as an
involuntary loss → retry/resume). IndexDir now defers it on every abort
path, idempotently no-opping on success and on already-removed sessions
(force-stop / idle reap). The retry then re-enters via reconcile-resume
instead of bouncing off the conflict.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two changes that stop a transient "embedding queue saturated" from
killing a server-side repo index.
1. repoindexer: treat ErrBusy as retryable backpressure.
ErrBusy ("retry after Ns") is the queue asking the caller to slow
down — the HTTP/CLI path honours it via 503 + Retry-After, but the
in-process driver propagated it as a permanent walk failure. Combined
with the (now-fixed) session leak, one transient saturation stranded
the project in 'error' for minutes. flush() now backs off the hinted
delay (capped at 30s, total bounded at 5m) and retries the SAME batch;
ProcessFiles bumps lastActivity in its prepare stage so the waits
don't trip the idle reaper. Any non-busy error still returns at once.
2. voyage: restore real planning headroom (defaultMaxTokensPerBatch
100K → 80K). estimateTokens divides bytes by 2, but dense code runs
as hot as ~1.4 bytes/token, so a POST can carry ~1.43x the estimated
tokens. The old 100K target (17% headroom under Voyage's 120K cap)
couldn't cover that ~43% undercount, so estimated-95K batches shipped
as real-122K POSTs and tripped the cap, forcing adaptive bisect+retry
on every hot file. 80K keeps the worst case (~114K) under the cap;
bisecting drops back to a true outlier safety net. Cost: ~20% smaller
batches on prose.
Tests: IndexDir rides out an ErrBusy-then-OK embedder and still indexes
the file; voyage TPM/burst comments updated for the 80K budget.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two comment-only fixes surfaced in review of the consolidated PR: - gitrepos.ReindexProject: the comment claimed clearing indexed_sha routes the clone handler "through the full-reindex branch". Since the reconcile rework, an empty IndexedSHA routes to reconcile (resume); what actually forces a full wipe is ClonePayload.ForceFull, checked first in handleClone's mode switch. Clarify that the indexed_sha clear is purely for dashboard "uncommitted" immediacy, not mode determination. - indexer.embedPrepared: document that cross-file batching couples a group's fate on a NON-fatal embed error (one file's failure skips the whole group this pass), why it's acceptable (reconcile retries skipped files), and the residual edge (a persistently-failing file can poison its grouped neighbours). No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix: indexing OOM on large repos + resume/reliability + faster embedding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promote
develop→mainfor a server release. Bundles PR #78 (the only delta overmain).What's in here
v0.20.2— fixes the 9–100 GB RSS OOM on large external repos (e.g. vscode); index now stable at ~2.5 GB.ForceFull; orphaned-job recovery +ReconcileStuckProjects; cancellable jobs context for clean shutdown.index_embed_batch_chunks), wired through config → DB (migration 15) → runtimecfg → admin API → OpenAPI → dashboard.CIX_PPROF_ADDR, off by default).FailIndexing) so retries re-enter via reconcile-resume instead ofErrSessionConflict; ride out embedding-queueErrBusybackpressure with bounded backoff; Voyage planning headroom 100K→80K to stop bisect churn on dense code.Verification
go build ./...,go vet,go test -raceon all changed packages — green.make openapi-check— generated stubs in sync.functionchunks tomodulein C files containing anenum(content still searchable). Tracked via skippedTestChunkFile_C_EnumRegression.🤖 Generated with Claude Code