Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)#90
Open
unverbraucht wants to merge 2 commits intoSearchSavior:mainfrom
Conversation
Covers the working Python-API path (optimum.intel.OVModelForCausalLM) after hitting a silent-truncate bug in optimum-cli export openvino for the Qwen3-Reranker-4B + int8 path. Documents prerequisites, step-by-step conversion, verification, and how to wire the resulting IR into openarc_config.json as a rerank model under the optimum engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optimum_EMB previously always applied last_token_pool to the encoder
output, which is correct for Qwen3-Embedding but wrong for encoder-style
sentence-transformers models (CLS pooling) or mean-pooled ones. Load now
inspects 1_Pooling/config.json and picks cls / mean / last accordingly,
defaulting to last when the file is absent so existing Qwen3-Embedding
deployments keep their behavior.
runtime_config may set "pool_mode" to pin the choice explicitly and
protect against upgrade regressions on models whose shipped ST config
would otherwise change pooling:
"runtime_config": {"pool_mode": "last"}
Unknown values raise ValueError on load rather than silently falling
through to last-token.
Tests: 14 unit tests cover each pool fn, the auto-detect ladder, the
runtime-config override and its validation. Two integration tests (added
to the existing bge-m3-local-path pattern) load a real bge-m3 IR and
verify the CLS auto-detect + override behavior end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related changes for using more embedding/reranker models on OpenArc:
feat(embed): dispatch pooling mode from sentence-transformers metadata so encoder-style embedders (bge-m3, e5, MiniLM, etc.) return correct vectors instead of last-token of the final non-pad position. Closes Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) #89 .docs: add a working OpenVINO IR conversion recipe for Qwen3-Reranker that documents the Python-API path (theoptimum-cli export openvinoroute silently truncates the XML to 0 bytes for this model).Both are independent and low-risk; bundling per maintainer OK.
Embedding pool dispatch
Optimum_EMB.generate_embeddingspreviously hardcodedlast_token_pool. That's right forQwen3-Embedding-*(decoder, last-token) but silently wrong for the much larger family of sentence-transformers encoders:Qwen3-Embedding-*BAAI/bge-*sentence-transformers/*,intfloat/multilingual-e5-*The fix matches sentence-transformers' own precedence:
runtime_config.pool_modeoverride ("cls" | "mean" | "last"). Unknown values raise at load time so typos don't fall back silently.<model_path>/1_Pooling/config.jsonauto-detect (pooling_mode_cls_token→ cls,pooling_mode_mean_tokens→ mean)."last"— preserves Qwen3-Embedding behavior. No-op for existing users.Verification
pool_mode == "cls"and a 1024-dim unit-normed vector.cos(ov, pt) > 0.999for bge-m3./v1/embeddingsreturning correct vectors for bge-m3 and/v1/rerankforqwen3-4b-reranker.Qwen3-Reranker conversion docs
docs/openvino_qwen3.mdwalks through convertingQwen/Qwen3-Reranker-{0.6B,4B}to OpenVINO IR with INT8 weight compression via theOVModelForCausalLMPython API, which is the only path that produced a usable model end-to-end. The CLI path (optimum-cli export openvino) currently exits 0 while writing a 0-byteopenvino_model.xmland a stub.bin; happy to file that separately upstream if useful.Test plan
pytest src/tests/test_optimum_emb_unit.py— all 14 passpytest src/tests/test_optimum_emb_integration.py— passes when bge-m3 / Qwen3-Embedding OV IRs are present, skips cleanly otherwiseopenarc servewithqwen3-4b-rerankerandbge-m3registered, exercised via/v1/embeddingsand/v1/rerankagainst a live GPUQwen3-Embedding-*deployments — requires a maintainer with that exact model loaded; default path is unchanged so behavior should be identical.