issue/388 [BugFix](basic_llm_processor): prevent duplicate BOS token in Llama-3/3.1 chat by rubik-hua · Pull Request #389 · InfiniTensor/InfiniLM

rubik-hua · 2026-05-20T09:56:02Z

When using model.chat() with Llama-3/3.1 models, the framework inadvertently prepends two <|begin_of_text|> (BOS, token ID 128000) tokens to the prompt_token_ids. This shifts the RoPE positional encodings by 1, causing the greedy decoding output to diverge significantly from HuggingFace.

Root cause:
The Llama-3/3.1 chat template explicitly includes <|begin_of_text|> at the start of the rendered string. Later, BasicLLMProcessor.__call__ passes this string to self.tokenizer(prompt), which defaults to add_special_tokens=True. Since LlamaTokenizerFast initializes with add_bos_token=True by default, the tokenizer automatically prepends a second BOS token via its Rust backend PostProcessor.

Fix:
Explicitly pass add_special_tokens=False to the tokenizer calls in BasicLLMProcessor.__call__. Since the chat template is already responsible for adding necessary special tokens, the tokenizer should only perform pure text-to-ID mapping.

修复前，input_ids有两个BOS符，实际推理时看起来也没有什么影响。当想使用贪心解码来验证RotaryEmbedding的时候就会有问题，此时一般会去跟HF原生的输出对比输出token以及logprobs，双重BOS导致输出天然不一致，于是就想把这个隐藏的bug修复掉。

修复后，double bos消失

重新跑一遍已有的模型推理就能验证对原有推理逻辑没影响。

…chat When using `model.chat()` with Llama-3/3.1 models, the framework inadvertently prepends two `<|begin_of_text|>` (BOS, token ID 128000) tokens to the prompt_token_ids. This shifts the RoPE positional encodings by 1, causing the greedy decoding output to diverge significantly from HuggingFace. Root cause: The Llama-3/3.1 chat template explicitly includes `<|begin_of_text|>` at the start of the rendered string. Later, `BasicLLMProcessor.__call__` passes this string to `self.tokenizer(prompt)`, which defaults to `add_special_tokens=True`. Since `LlamaTokenizerFast` initializes with `add_bos_token=True` by default, the tokenizer automatically prepends a second BOS token via its Rust backend PostProcessor. Fix: Explicitly pass `add_special_tokens=False` to the tokenizer calls in `BasicLLMProcessor.__call__`. Since the chat template is already responsible for adding necessary special tokens, the tokenizer should only perform pure text-to-ID mapping.

rubik-hua requested a review from a team May 20, 2026 09:56

rubik-hua force-pushed the double_bos branch from 04545d9 to e0110e1 Compare May 20, 2026 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue/388 [BugFix](basic_llm_processor): prevent duplicate BOS token in Llama-3/3.1 chat#389

issue/388 [BugFix](basic_llm_processor): prevent duplicate BOS token in Llama-3/3.1 chat#389
rubik-hua wants to merge 1 commit into
InfiniTensor:mainfrom
rubik-hua:double_bos

rubik-hua commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rubik-hua commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rubik-hua commented May 20, 2026 •

edited

Loading