Skip to content

Extract vector encoding queue#25

Open
mariusvniekerk wants to merge 7 commits into
mainfrom
codex/extract-vector-encoding-queue
Open

Extract vector encoding queue#25
mariusvniekerk wants to merge 7 commits into
mainfrom
codex/extract-vector-encoding-queue

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Contributor

Vector encoding queues were previously only available inside msgvault, which made the crash-safe claim/release/complete mechanics hard to reuse in other Kenn tools. This PR moves that SQL-backed task queue into kit as an app-neutral helper while keeping schema ownership with callers.

The queue validates configured table and column identifiers before interpolating SQL, supports transactional bulk enqueue for callers that need to select groups and insert work atomically, and preserves msgvault's existing pending_embeddings table shape as the default schema.

Validation: go test ./...; go vet ./...

generated by a clanker

Vector embedding queues were only available inside msgvault, which made other Kenn tools copy or reimplement the same crash-safe claim/release mechanics. Moving the SQL-backed task queue into kit gives callers a shared, app-neutral helper while preserving msgvault's existing table shape.

The package keeps schema ownership with callers and validates SQL identifiers before interpolating table or column names, so future consumers can adapt the queue to their own storage without importing msgvault internals.

Validation: go test ./...; go vet ./...

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (e07897b)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 2m25s), codex_security (codex/security, done, 1m25s) | Total: 3m50s

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

how does this relate to kenn-io/msgvault#411?

@mariusvniekerk

Copy link
Copy Markdown
Contributor Author

doesn't yet? i guess we can reframe it based on that?

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

yeah, let met take a closer look at that and see if it makes sense to merge and then we can reassess. I agree it doesn't make sense for every project to have its own vector embedding pipeline management system

The initial extraction lifted msgvault's pending_embeddings claim queue, but that is storage and scheduling policy a consumer owns, and msgvault is already moving off it toward scan-and-fill (kenn-io/msgvault#411). Shipping it would have given kit a reusable package msgvault no longer uses.

The surface that is genuinely shared across callers — msgvault and kata both embed for search — is chunking, model-generation identity, batched encoding, and merging results across generations. This adds those as pure transforms, plus a Store[K,G] contract and Fill/Search flows that own the scan-and-fill and query-and-merge orchestration so a backend supplies only SQL.

Document and generation identity stay opaque (msgvault int64, kata UUID); storage and query construction live in backend subpackages. vector/sqlitevec is the reference backend on the same sqlite-vec binding msgvault uses, with per-generation vec0 tables so a model migration across differing dimensions still serves a union of live generations. This also fixes the go.mod tidy drift that was failing Go hygiene.

Validation: go build/vet/test ./... with CGO; vec0 exercised hermetically via the bundled sqlite-vec extension.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (061a0a3)

Medium-risk issues remain in the vector toolkit changes.

Medium

  • vector/sqlitevec/store.go:93 - SaveVectors ignores RowsAffected when stamping the source document. If the document was deleted between scan and save, or a caller passes a missing key, the transaction can commit orphan vector rows that QueryGeneration may later return. Check the update result and roll back when no source row was stamped.

  • vector/generation.go:42 - Fingerprint serializes params as key=value\n without escaping or length delimiters, so different configs can hash the same preimage, such as a value containing \nother=value. Use length-prefixed fields or a structured sorted encoding before hashing.


Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m27s), codex_security (codex/security, done, 2m2s) | Total: 6m35s

Two medium findings from the roborev panel on 061a0a3. Generation.Fingerprint built its preimage by joining params as key=value lines, so a value containing the separator could hash identically to a different param set — distinct vector spaces could share a fingerprint and silently skip a needed re-embed. It now hashes the JSON encoding, which escapes values and sorts map keys, so the preimage is unambiguous.

SaveVectors stamped the source document without checking the update hit a row, so if a document was deleted between scan and save (or a caller passed a missing key) the transaction committed vector rows with no backing document, which QueryGeneration would later return as orphan hits. It now checks RowsAffected and rolls back when nothing was stamped.

Validation: go vet/test ./vector/... with CGO.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (792a47d)

Medium-risk issue found; no Critical or High findings.

Medium

  • vector/sqlitevec/store.go:180 - QueryGeneration joins KNN results only to the chunk map, so vectors for documents deleted after embedding can still be returned as hits. Filter results against Schema.DocsTable on the configured id column, or add source-row deletion cleanup; add a test that embeds a document, deletes it, and verifies search no longer returns it.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 2m16s) | Total: 5m39s

@mariusvniekerk

Copy link
Copy Markdown
Contributor Author

The comment by roborev is mostly worthless for the last one since thats more or less just an exampe

The fingerprint is persisted by callers and compared to decide whether to re-embed, so its stability has to survive future edits to Generation. The prior version hand-listed the fields to hash, which has a dangerous failure mode: a field added later would be silently excluded, letting two distinct vector spaces share a fingerprint and skip a needed re-embed.

Fingerprint now encodes the struct itself, so any field added later participates automatically, then re-encodes through a generic value. encoding/json sorts object keys at every level, so neither struct field order nor map insertion order affects the hash; UseNumber preserves numeric precision; and omitempty keeps an unused new field from shifting existing fingerprints. A pinned canonical-encoding test and a reflection tripwire on the field set make any change to the type or encoding fail CI rather than silently re-fingerprint every stored vector.

Validation: go vet/test ./vector/... with CGO.

Generated with Claude Code (Opus 4.8)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (2a9fde1)

PR has two medium findings; no high or critical issues were reported.

Medium

  • vector/encode.go:83: EncodeBatched checks failed() before blocking on the semaphore. If an in-flight batch sets firstErr while the loop is waiting for capacity, the next batch can still launch after the semaphore is released, causing extra encoder/provider calls despite the documented “stops launching work at the first error” behavior.
    Fix: Re-check firstErr and ctx.Err() immediately after acquiring the semaphore, release the slot, and stop dispatching if work should stop.

  • vector/sqlitevec/store.go:180: QueryGeneration joins KNN results only to the chunk map, not back to the caller’s documents table. If a document is deleted after vectors are saved, stale chunk rows can still return hits for a document that no longer exists.
    Fix: Join the configured documents table on IDColumn when returning hits, or add cascading cleanup for deleted documents, with a regression test for deleted documents not being returned.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m15s), codex_security (codex/security, done, 15s) | Total: 3m38s

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

Codex posting on behalf of Wes.

I like the direction this PR landed on after the pivot away from extracting the old pending_embeddings queue. That queue is no longer the durable shared abstraction for msgvault, especially with kenn-io/msgvault#411 moving embedding work to scan-and-fill coverage.

The reusable pieces here look like the right layer for kit:

  • generation identity/fingerprinting for model + dimension + embedding-affecting params
  • chunk splitting and batched encode orchestration
  • a backend-neutral Store[K,G] boundary
  • scan-and-fill flow over that store
  • live-generation search fanout plus merge semantics during active/building migrations
  • sqlitevec as a reference/default backend for new SQLite-backed projects

For msgvault specifically, I would not try to consume this immediately before merging kenn-io/msgvault#411. Msgvault still has extra production semantics around mutable message CAS, watermark/backstop recovery, skip/delete handling, lifecycle gates, stats, filters, and hybrid FTS/vector search. But those feel like product/backend-specific layers above this package, not arguments against the extraction.

One caveat for future evolution: the current Fill contract only passes {Doc, Content} through Pending, so projects with mutable source rows need to hide content-version/CAS checks inside their store implementation. If we want this to become the common production fill loop for msgvault-like systems, we may eventually want an explicit version token or CAS-aware save shape.

Overall: this seems like a good foundation for reusable vector infrastructure, and msgvault#411 is useful context for why scan-and-fill/generation merge is the right shared direction rather than a reusable pending queue.

Windows CI exposed a portability gap in the reference sqlitevec backend: the
sqlite-vec cgo binding assumes SQLite development headers are available, but
hosted Windows runners do not provide sqlite3.h. That made the package fail at
build time before any tests could run.

Using modernc everywhere would solve the portability problem, but it would
make the common Unix+cgo path slower for the workload this backend exists to
support. The benchmark added here keeps that tradeoff explicit: on this
darwin/arm64 run, mattn was roughly 4x faster for KNN queries and modestly
faster for vector replacement.

Keep the faster cgo-backed sqlite-vec path where it is viable, and use
modernc's headerless sqlite-vec support only where cgo is unavailable or
unsuitable. That gives Windows CI a buildable backend without regressing the
normal Unix performance path.
@mariusvniekerk mariusvniekerk force-pushed the codex/extract-vector-encoding-queue branch from 9c36745 to 005d19c Compare June 26, 2026 13:54
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (005d19c)

Summary verdict: two Medium issues need attention; no High or Critical findings were reported.

Medium

  • vector/sqlitevec/store.go:177
    QueryGeneration maps vector hits through the chunk table only, so stale chunk rows can still return document IDs after the source row is deleted or after the document is re-stamped to a different generation.
    Fix: Join the caller’s docs table in the query and require the document to still exist and match the queried generation stamp, with tests for deleted and re-stamped documents.

  • vector/sqlitevec/sqlitevec.go:128
    EnsureGeneration silently accepts an existing generation key with a different dimension, updates its state, and leaves the old vec table dimension in place. Later saves/searches can then fail against a generation that was just marked live.
    Fix: On conflict, compare the stored dimension to model.Dimensions and return an error if they differ.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 5m45s), codex_security (codex/security, done, 1m43s) | Total: 7m35s

wesm and others added 2 commits July 4, 2026 12:52
Fill previously stamped documents unconditionally, so an edit landing
between scan and save was silently marked embedded with stale vectors,
and a single permanently-failing document aborted every future run.

Pending now carries an opaque revision token; SaveVectors stamps only
if it is unchanged and returns ErrStale otherwise, leaving the document
pending. Fill treats stale documents as next-run work and consults a
new OnEncodeError hook to stamp-drop documents a model permanently
rejects. sqlitevec implements the conditional stamp behind an optional
Schema.RevisionColumn.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@wesm

wesm commented Jul 4, 2026

Copy link
Copy Markdown
Member

Comparison with msgvault's embedding foundation

For the record: this package was extracted alongside a redesign of msgvault's embedding pipeline (msgvault/internal/vector, culminating in msgvault #411, which replaced the pending_embeddings queue table with a scan-and-fill design). This comment records how the two codebases relate, what this PR adopted, and what it deliberately left behind.

What the two share

Both are built around the same core ideas:

  • Scan-and-fill instead of a queue table. Work is defined by a predicate over the source table (embed_gen IS NULL OR embed_gen <> target), and stamping embed_gen is what removes a document from the work set. Changing the model creates a new generation id, which instantly makes every document "pending" again with no explicit requeue. msgvault: internal/store/embed_gen.go + internal/vector/embed/worker.go. Here: Store.PendingForGeneration / SaveVectors and the Fill flow.
  • Generations as first-class records with a building → active → retired lifecycle, per-generation dimensions, and a fingerprint that identifies the vector space so a config change forces a rebuild rather than silently mixing spaces.
  • Chunked documents keyed by (generation, document, chunk index), with search rolling chunk hits up to one hit per document.
  • Vectors in sqlite-vec vec0 tables, one per dimension/generation, cosine distance.

What this PR adopted from msgvault after comparing (added in the latest commits)

  • Optimistic stamping. msgvault's worker reads a last_modified token with the content and stamps via compare-and-swap (SetEmbedGenIfUnchanged), so an edit landing between scan and save leaves the document pending instead of being marked embedded with stale vectors. The Store contract now mirrors this: Pending.Revision is an opaque token, SaveVectors stamps only if it is unchanged and returns ErrStale otherwise, and Fill leaves stale documents for the next run so an actively edited document cannot starve the loop. In sqlitevec this is the optional Schema.RevisionColumn.
  • Poison-document handling. msgvault classifies permanent 4xx failures and stamp-drops messages a model deterministically rejects, so one bad input cannot wedge every future run. Here that is FillOptions.OnEncodeError: returning true stamps the document with no vectors (a stamp-only save) and the fill continues; nil/false aborts, which remains the right default for transient failures.

Where the designs intentionally differ

  • This package is generic; msgvault's is message-typed. Document keys (K) and generation ids (G) are opaque type parameters here; msgvault hard-codes MessageID int64, an email-specific Filter, and SQL against its messages/message_bodies schema. The sqlitevec.Schema (caller-named table/columns) replaces msgvault's baked-in queries.
  • Cross-generation search is richer here. msgvault resolves a single generation per query (ResolveActiveForFingerprint, returning ErrIndexStale/ErrIndexBuilding). This package queries every live generation — each with its own encoder — and merges with union coverage, so mid-migration searches see both the building and active generations. Three merge strategies (normalized score, raw, RRF) are provided; msgvault has no cross-generation merge.
  • Fingerprints are structural here, versioned strings there. msgvault concatenates model:dimension:preprocess:... with hand-bumped version constants; Generation.Fingerprint() here canonicalizes the struct through JSON and hashes it, so new fields participate automatically.
  • Encoding concurrency lives in the flow here. EncodeBatched batches and parallelizes encode calls; msgvault's worker is sequential and gets concurrency from the daemon scheduler.

msgvault mechanics deliberately not ported

  • The watermark table and backstop job. msgvault's forward-scan cursor is an optimization for its very large tables and cross-database (msgvault.db / vectors.db) split; Fill here rescans the pending predicate, which is itself the backstop. A backend that needs a cursor can add one privately behind PendingForGeneration.
  • Retry/backoff and the OpenAI-compatible HTTP client. EncodeFunc owns retries because retryability is provider-specific. msgvault's embed/client.go (429/Retry-After, permanent-4xx classification) is generic, though — a candidate for a sibling vector/… package later so callers stop rewriting it.
  • Email preprocessing, hybrid FTS+ANN fusion, scoped builds, and the one-time schema migrations — all msgvault-specific.

Known gaps to follow up (not in this PR)

  • No Delete: removing a source document leaves orphan vectors, and retiring a generation does not drop its vec0 table. msgvault's Backend.Delete / retire semantics cover this.
  • Split is a fixed-window cut; msgvault's ChunkText (paragraph → sentence → word soft breaks, char offsets for highlighting) is pure string logic and could replace it nearly verbatim.
  • The generation lifecycle here records state but does not enforce it; msgvault enforces ≤1 active / ≤1 building via partial unique indexes and has an atomic activate with a coverage gate.

msgvault does not yet import this package (it uses only kit/daemon and kit/selfupdate), so this is the intended future home for the generic parts of its pipeline, not a live dependency.

@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (bf57c0d)

Returns stale vector hits from deleted or no-longer-active documents; needs filtering or cleanup before merge.

Medium

  • vector/sqlitevec/store.go:223 - QueryGeneration searches vec0 hits and joins only the managed chunk map, not the caller’s source documents table or embedding generation stamp. As a result, vectors for deleted, edited, pending, revoked, or migrated documents can still be returned. These stale hits may leak deleted content signals and can consume per-generation result limits, hiding active-only documents during migration.

    Suggested fix: Join query results back to DocsTable on IDColumn and require the source row to exist with an embedding generation matching gen, or enforce cleanup/cascade deletion of stale vectors when documents are deleted or marked pending. Add tests for deleted rows, cleared generation stamps, and active-only hits being hidden behind stale migrated hits.


Reviewers: 2 done | Synthesis: codex, 7s | Total: 5m43s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants