Skip to content

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)#90

Open
unverbraucht wants to merge 2 commits intoSearchSavior:mainfrom
KIntegrated:feat/embedding-pool-dispatch
Open

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)#90
unverbraucht wants to merge 2 commits intoSearchSavior:mainfrom
KIntegrated:feat/embedding-pool-dispatch

Conversation

@unverbraucht
Copy link
Copy Markdown

Summary

Two related changes for using more embedding/reranker models on OpenArc:

Both are independent and low-risk; bundling per maintainer OK.

Embedding pool dispatch

Optimum_EMB.generate_embeddings previously hardcoded last_token_pool. That's right for Qwen3-Embedding-* (decoder, last-token) but silently wrong for the much larger family of sentence-transformers encoders:

Model family Correct pool Old behavior
Qwen3-Embedding-* last
BAAI/bge-* cls ❌ silently wrong
sentence-transformers/*, intfloat/multilingual-e5-* mean ❌ silently wrong

The fix matches sentence-transformers' own precedence:

  1. runtime_config.pool_mode override ("cls" | "mean" | "last"). Unknown values raise at load time so typos don't fall back silently.
  2. <model_path>/1_Pooling/config.json auto-detect (pooling_mode_cls_token → cls, pooling_mode_mean_tokens → mean).
  3. Default "last" — preserves Qwen3-Embedding behavior. No-op for existing users.

Verification

  • 14 unit tests covering each pool, the autodetect fallback chain, the runtime override, and unknown-mode rejection.
  • Integration test loads the converted bge-m3 OV IR, asserts pool_mode == "cls" and a 1024-dim unit-normed vector.
  • End-to-end against the PyTorch reference: cos(ov, pt) > 0.999 for bge-m3.
  • Live serving on GPU via /v1/embeddings returning correct vectors for bge-m3 and /v1/rerank for qwen3-4b-reranker.

Qwen3-Reranker conversion docs

docs/openvino_qwen3.md walks through converting Qwen/Qwen3-Reranker-{0.6B,4B} to OpenVINO IR with INT8 weight compression via the OVModelForCausalLM Python API, which is the only path that produced a usable model end-to-end. The CLI path (optimum-cli export openvino) currently exits 0 while writing a 0-byte openvino_model.xml and a stub .bin; happy to file that separately upstream if useful.

Test plan

  • pytest src/tests/test_optimum_emb_unit.py — all 14 pass
  • pytest src/tests/test_optimum_emb_integration.py — passes when bge-m3 / Qwen3-Embedding OV IRs are present, skips cleanly otherwise
  • openarc serve with qwen3-4b-reranker and bge-m3 registered, exercised via /v1/embeddings and /v1/rerank against a live GPU
  • Cosine match vs PyTorch reference on bge-m3 (>0.999)
  • No-op confirmation on existing Qwen3-Embedding-* deployments — requires a maintainer with that exact model loaded; default path is unchanged so behavior should be identical.

unverbraucht and others added 2 commits April 24, 2026 21:40
Covers the working Python-API path (optimum.intel.OVModelForCausalLM)
after hitting a silent-truncate bug in optimum-cli export openvino
for the Qwen3-Reranker-4B + int8 path. Documents prerequisites,
step-by-step conversion, verification, and how to wire the resulting
IR into openarc_config.json as a rerank model under the optimum engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optimum_EMB previously always applied last_token_pool to the encoder
output, which is correct for Qwen3-Embedding but wrong for encoder-style
sentence-transformers models (CLS pooling) or mean-pooled ones. Load now
inspects 1_Pooling/config.json and picks cls / mean / last accordingly,
defaulting to last when the file is absent so existing Qwen3-Embedding
deployments keep their behavior.

runtime_config may set "pool_mode" to pin the choice explicitly and
protect against upgrade regressions on models whose shipped ST config
would otherwise change pooling:

  "runtime_config": {"pool_mode": "last"}

Unknown values raise ValueError on load rather than silently falling
through to last-token.

Tests: 14 unit tests cover each pool fn, the auto-detect ladder, the
runtime-config override and its validation. Two integration tests (added
to the existing bge-m3-local-path pattern) load a real bge-m3 IR and
verify the CLS auto-detect + override behavior end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)

1 participant