[AMD] improve dsr1 fp4 disagg perf on mi355x by billishyahao · Pull Request #1236 · SemiAnalysisAI/InferenceX

billishyahao · 2026-04-30T10:46:49Z

replacement of #983

The new patch is adding the following optimization:

- "Bump SGL mori image to lmsysorg/sglang-rocm"
- "Add more high tput / low latency sweep configs"
- "Enable v2 mxfp4 DSR1 0528 model"
- "Enable fp4 disp / fp8 combine feature on mori"
- "Enable Mori SDMA + two batch overlapping feature"

…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor

github-actions · 2026-05-03T03:09:10Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25267403349
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25267403349

github-actions · 2026-05-03T04:16:43Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268431600
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25268431600

billishyahao · 2026-05-03T04:21:41Z

Can we got the review for this patch ? @functionstackx @Oseltamivir @cquil11

Sweep 19 of 20 passed, 1 is canceled by user

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25241387090

Eval all passed

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25268431600/

functionstackx

added an comment related to ur current code of "if evals: set xyz"

functionstackx · 2026-05-03T04:40:24Z

+
+    unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
+    unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
+    unset SGLANG_MORI_FP8_COMB


@billishyahao same thing here

github-actions · 2026-05-03T07:33:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25269775978
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25269775978

github-actions · 2026-05-03T08:44:08Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25273191587
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25273191587

github-actions · 2026-05-03T15:08:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25282687262
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25282687262

github-actions · 2026-05-03T16:12:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284166545
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25284166545

github-actions · 2026-05-03T17:05:09Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284187965
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25284187965

functionstackx · 2026-05-04T19:27:07Z

+    DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g')
+    unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
+    unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
+    unset SGLANG_MORI_FP8_COMB


@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.

It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.

can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

may need more debugging from @Oseltamivir

BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math

we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.

Your local sglang bench (not using inferencex harness) is quite low at 91%

The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.

github-actions · 2026-05-05T02:49:49Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355064300
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355064300

github-actions · 2026-05-05T06:38:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T10:04:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T12:18:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T13:42:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

billishyahao and others added 30 commits March 16, 2026 08:36

[AMD] add dsr1 mxfp4 v2 sweep points

0383696

fix

18e05b1

change mtp model to fp8

0bd347f

change fp8 image

754e53c

bump image to 0327

f29f2d0

remove specv2

a44c7eb

consolidate dsr1 fp4 configs

2514136

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-march15

3b4d4ab

bump fp8 image to 0327

682a4ab

fix crash

64bf100

fix env

c44e175

cleanup

0a41f89

add perf change log

7282748

add deprecate comments

e6d4b32

add spec v2 env

b7dd65f

bump the docker image

12a4ba0

add stream control to eliminate cpu overhead

597a458

tune the config

f715e47

bump image

2ea82d5

tune config

16384e7

add new exp config

4d733e7

enable log level info

83af743

fix mori env

0c3083e

bump image

1c61622

fix log

e2d2ac9

bump the image

d2a7988

fix

b09ae6c

fix

2c3ee04

fix

69102f7

billishyahao added 2 commits May 3, 2026 02:09

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

d90995f

fix eval

1a007bd

fix

3235860

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

8fbd2ab

functionstackx reviewed May 3, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/server.sh Outdated

functionstackx requested changes May 3, 2026

View reviewed changes

functionstackx reviewed May 3, 2026

View reviewed changes

billishyahao added 3 commits May 3, 2026 07:28

fix args diff for eval

3ff0812

add eval only

2a32c13

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

219cf7a

fix

a73f622

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

8039b5f

Merge branch 'main' into amd/mi355x-dsfp4-april14

25fb9d1

functionstackx reviewed May 4, 2026

View reviewed changes

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

2c48183

billishyahao added full-sweep-enabled and removed sweep-enabled labels May 5, 2026

Conversation

billishyahao commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

billishyahao commented May 3, 2026

Uh oh!

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

functionstackx May 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

functionstackx May 4, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

billishyahao commented Apr 30, 2026 •

edited

Loading

functionstackx May 5, 2026 •

edited

Loading