Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) by unverbraucht · Pull Request #90 · SearchSavior/OpenArc

unverbraucht · 2026-04-25T13:57:30Z

Summary

Two related changes for using more embedding/reranker models on OpenArc:

feat(embed): dispatch pooling mode from sentence-transformers metadata so encoder-style embedders (bge-m3, e5, MiniLM, etc.) return correct vectors instead of last-token of the final non-pad position. Closes Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) #89 .
docs: add a working OpenVINO IR conversion recipe for Qwen3-Reranker that documents the Python-API path (the optimum-cli export openvino route silently truncates the XML to 0 bytes for this model).

Both are independent and low-risk; bundling per maintainer OK.

Embedding pool dispatch

Optimum_EMB.generate_embeddings previously hardcoded last_token_pool. That's right for Qwen3-Embedding-* (decoder, last-token) but silently wrong for the much larger family of sentence-transformers encoders:

Model family	Correct pool	Old behavior
`Qwen3-Embedding-*`	last	✅
`BAAI/bge-*`	cls	❌ silently wrong
`sentence-transformers/`, `intfloat/multilingual-e5-`	mean	❌ silently wrong

The fix matches sentence-transformers' own precedence:

runtime_config.pool_mode override ("cls" | "mean" | "last"). Unknown values raise at load time so typos don't fall back silently.
<model_path>/1_Pooling/config.json auto-detect (pooling_mode_cls_token → cls, pooling_mode_mean_tokens → mean).
Default "last" — preserves Qwen3-Embedding behavior. No-op for existing users.

Verification

14 unit tests covering each pool, the autodetect fallback chain, the runtime override, and unknown-mode rejection.
Integration test loads the converted bge-m3 OV IR, asserts pool_mode == "cls" and a 1024-dim unit-normed vector.
End-to-end against the PyTorch reference: cos(ov, pt) > 0.999 for bge-m3.
Live serving on GPU via /v1/embeddings returning correct vectors for bge-m3 and /v1/rerank for qwen3-4b-reranker.

Qwen3-Reranker conversion docs

docs/openvino_qwen3.md walks through converting Qwen/Qwen3-Reranker-{0.6B,4B} to OpenVINO IR with INT8 weight compression via the OVModelForCausalLM Python API, which is the only path that produced a usable model end-to-end. The CLI path (optimum-cli export openvino) currently exits 0 while writing a 0-byte openvino_model.xml and a stub .bin; happy to file that separately upstream if useful.

Test plan

pytest src/tests/test_optimum_emb_unit.py — all 14 pass
pytest src/tests/test_optimum_emb_integration.py — passes when bge-m3 / Qwen3-Embedding OV IRs are present, skips cleanly otherwise
openarc serve with qwen3-4b-reranker and bge-m3 registered, exercised via /v1/embeddings and /v1/rerank against a live GPU
Cosine match vs PyTorch reference on bge-m3 (>0.999)
No-op confirmation on existing Qwen3-Embedding-* deployments — requires a maintainer with that exact model loaded; default path is unchanged so behavior should be identical.

Covers the working Python-API path (optimum.intel.OVModelForCausalLM) after hitting a silent-truncate bug in optimum-cli export openvino for the Qwen3-Reranker-4B + int8 path. Documents prerequisites, step-by-step conversion, verification, and how to wire the resulting IR into openarc_config.json as a rerank model under the optimum engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Optimum_EMB previously always applied last_token_pool to the encoder output, which is correct for Qwen3-Embedding but wrong for encoder-style sentence-transformers models (CLS pooling) or mean-pooled ones. Load now inspects 1_Pooling/config.json and picks cls / mean / last accordingly, defaulting to last when the file is absent so existing Qwen3-Embedding deployments keep their behavior. runtime_config may set "pool_mode" to pin the choice explicitly and protect against upgrade regressions on models whose shipped ST config would otherwise change pooling: "runtime_config": {"pool_mode": "last"} Unknown values raise ValueError on load rather than silently falling through to last-token. Tests: 14 unit tests cover each pool fn, the auto-detect ladder, the runtime-config override and its validation. Two integration tests (added to the existing bge-m3-local-path pattern) load a real bge-m3 IR and verify the CLS auto-detect + override behavior end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

unverbraucht and others added 2 commits April 24, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)#90

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)#90
unverbraucht wants to merge 2 commits intoSearchSavior:mainfrom
KIntegrated:feat/embedding-pool-dispatch

unverbraucht commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unverbraucht commented Apr 25, 2026

Summary

Embedding pool dispatch

Verification

Qwen3-Reranker conversion docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant