[AMD] improve dsr1 fp4 disagg perf on mi355x#1236
[AMD] improve dsr1 fp4 disagg perf on mi355x#1236billishyahao wants to merge 73 commits intomainfrom
Conversation
…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25267403349 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268431600 |
|
Can we got the review for this patch ? @functionstackx @Oseltamivir @cquil11 Sweep 19 of 20 passed, 1 is canceled by user https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25241387090 Eval all passed https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25268431600/ |
functionstackx
left a comment
There was a problem hiding this comment.
added an comment related to ur current code of "if evals: set xyz"
|
|
||
| unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_DECODE | ||
| unset SGLANG_MORI_FP8_COMB |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25269775978 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25273191587 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25282687262 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284166545 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284187965 |
| DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g') | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_DECODE | ||
| unset SGLANG_MORI_FP8_COMB |
There was a problem hiding this comment.
@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.
It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.
can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it
There was a problem hiding this comment.
FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000
may need more debugging from @Oseltamivir
BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?
There was a problem hiding this comment.
FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000
@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math
There was a problem hiding this comment.
we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.
Your local sglang bench (not using inferencex harness) is quite low at 91%
There was a problem hiding this comment.
The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355064300 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
3 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |

replacement of #983
The new patch is adding the following optimization: