Skip to content

[AMD] improve dsr1 fp4 disagg perf on mi355x#1236

Open
billishyahao wants to merge 73 commits intomainfrom
amd/mi355x-dsfp4-april14
Open

[AMD] improve dsr1 fp4 disagg perf on mi355x#1236
billishyahao wants to merge 73 commits intomainfrom
amd/mi355x-dsfp4-april14

Conversation

@billishyahao
Copy link
Copy Markdown
Collaborator

@billishyahao billishyahao commented Apr 30, 2026

replacement of #983

The new patch is adding the following optimization:

- "Bump SGL mori image to lmsysorg/sglang-rocm"
- "Add more high tput / low latency sweep configs"
- "Enable v2 mxfp4 DSR1 0528 model"
- "Enable fp4 disp / fp8 combine feature on mori"
- "Enable Mori SDMA + two batch overlapping feature"

billishyahao and others added 30 commits March 16, 2026 08:36
…transformers v5

Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for
models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama
tokenizer architecture. The sglang server fixes this at startup, but the
benchmark client loads the tokenizer without these fixes, causing a ~5x
token count inflation (e.g. 7000 tokens -> 35000 tokens) and false
performance regressions in TTFT and throughput benchmarks.

Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and
add_bos_token recovery) that sglang server applies, so client and server
tokenize identically. No-op on transformers v4.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@billishyahao
Copy link
Copy Markdown
Collaborator Author

Can we got the review for this patch ? @functionstackx @Oseltamivir @cquil11

Sweep 19 of 20 passed, 1 is canceled by user

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25241387090

Eval all passed

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25268431600/

Comment thread benchmarks/multi_node/amd_utils/server.sh Outdated
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an comment related to ur current code of "if evals: set xyz"


unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
unset SGLANG_MORI_FP8_COMB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao same thing here

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g')
unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
unset SGLANG_MORI_FP8_COMB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.

It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.

can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

may need more debugging from @Oseltamivir

BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.

Your local sglang bench (not using inferencex harness) is quite low at 91%

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants