[AMD] Add DSv4 FP4 MI355X vLLM benchmark#1282
Conversation
| # triton_unfused MoE backend consumes the FP4 expert weights from the | ||
| # DeepSeek-V4-Pro FP4+FP8 mixed checkpoint; KV cache stays fp8. | ||
| dsv4-fp4-mi355x-vllm: | ||
| image: rocm/vllm-dev:deepseek-v4-latest |
There was a problem hiding this comment.
@Oseltamivir can u add the specific SHA like we do for all the other images?
c898aeb#diff-1bd56add5a769a3c46eaffc11bbbccc4d8546cf7c1a0877b735dc46581ac8405R2637
| # triton_unfused MoE backend consumes the FP4 expert weights from the | ||
| # DeepSeek-V4-Pro FP4+FP8 mixed checkpoint; KV cache stays fp8. | ||
| dsv4-fp4-mi355x-vllm: | ||
| image: rocm/vllm-dev:deepseek-v4-latest |
There was a problem hiding this comment.
🔴 The new dsv4-fp4-mi355x-vllm entry uses image: rocm/vllm-dev:deepseek-v4-latest — the only -latest-style tag in any master config. Every other vLLM image is version-pinned (v0.16.0–v0.19.1), and the sibling dsv4-fp4-mi355x-sglang entry right above is SHA+date-pinned (rocm/sgl-dev:rocm720-mi35x-a8410de-20260502-DSv4). A mutable tag breaks the perf-changelog attribution model — upstream rebuilds will silently shift benchmark results without a corresponding config commit. Suggest pinning to a dated/SHA tag (e.g., rocm/vllm-dev:deepseek-v4-<sha-or-date>) mirroring the sibling sglang convention.
Extended reasoning...
What the bug is. The new dsv4-fp4-mi355x-vllm block at amd-master.yaml:1645 sets image: rocm/vllm-dev:deepseek-v4-latest. The trailing -latest makes this a mutable tag — the registry can re-publish the same tag pointing at a new image SHA at any time without any change to this repo.\n\nWhy this is out of pattern. Grepping every other vLLM image in amd-master.yaml shows pinned tags exclusively: vllm/vllm-openai-rocm:v0.16.0, v0.17.0, v0.18.0, v0.19.0, v0.19.1, plus rocm/atom:…atom0.1.2.post for dsv4-fp8-mi355x-vllm. More tellingly, the directly-adjacent dsv4-fp4-mi355x-sglang entry uses rocm/sgl-dev:rocm720-mi35x-a8410de-20260502-DSv4 and its YAML comment explicitly says: "the SHA is encoded in the image tag, so bumping sglang is just an image tag bump here." That is the convention this PR is breaking.\n\nWhy existing code does not prevent it. Nothing in the harness validates tag immutability — the YAML is fed to the sweep runner as-is. The perf-changelog.yaml mechanism (this PR adds an entry at lines 2210–2220) attributes performance changes to config-key bumps — i.e., to git commits that change the image tag. With :deepseek-v4-latest there is no future commit when the upstream image is rebuilt; the changelog records numbers against a tag whose contents can drift.\n\nImpact. The YAML comment claims "the image already ships the validated build" and the PR description plans a gsm8k parity check (~0.954 flexible-extract). Both claims are anchored to whatever image happens to live behind :deepseek-v4-latest at the moment of validation. After that, any upstream re-push silently invalidates both the recorded gsm8k number and the per-sweep latency numbers, with no commit/PR to attribute the regression (or improvement) to. Mid-sweep regressions become unattributable, which is precisely the failure mode the perf-changelog workflow was designed to prevent.\n\nStep-by-step proof.\n1. PR #1282 lands; :deepseek-v4-latest resolves to image SHA A. Sweep runs, perf-changelog records numbers N_A.\n2. Upstream rebuilds rocm/vllm-dev:deepseek-v4-latest to fix an unrelated CVE. Tag now resolves to SHA B. No commit lands in this repo.\n3. Next nightly sweep pulls SHA B, produces numbers N_B \neq N_A. Bisecting attribution against git log finds no corresponding config-key change — the perf delta is invisible to the changelog model.\n4. Contrast with the sibling dsv4-fp4-mi355x-sglang entry: bumping its image requires changing the a8410de-20260502 portion of the tag, which is a commit in this repo and a perf-changelog entry.\n\nHow to fix. Replace rocm/vllm-dev:deepseek-v4-latest with a SHA- or date-pinned tag that the recipe authors publish (e.g., rocm/vllm-dev:deepseek-v4-<sha> or …-<YYYYMMDD>), mirroring the sibling sglang convention. Future image bumps then arrive as explicit config-key edits, preserving perf-changelog attribution. Severity is normal — reproducibility/attribution is a stated goal of this directory (per AGENTS.md), and there is precedent immediately above for the right pattern.
| - "FP4 path: same DeepSeek-V4-Pro FP4+FP8 mixed checkpoint; --moe-backend triton_unfused consumes the FP4 expert weights directly. KV cache stays fp8" | ||
| - "Recipe flags from the docs: VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_LINEAR=1, --gpu-memory-utilization 0.6, --max-num-seqs 128 (auto-grows past CONC), --max-num-batched-tokens 8192, --distributed-executor-backend mp, --async-scheduling, --enforce-eager, --tokenizer-mode/reasoning-parser/tool-call-parser deepseek_v4" | ||
| - "Search space: TP=8, CONC 1-128 for both 1k1k and 8k1k" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD (line 2219), which will resolve to a 404 once merged. Every other entry in this file uses a real numeric PR id; please replace TBD with 1282 (this PR's number) before merging.
Extended reasoning...
The bug
In perf-changelog.yaml line 2219, the new entry for dsv4-fp4-mi355x-vllm is given a placeholder pr-link:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBDTBD is not a valid PR number, so this URL produces a 404 in any tooling or UI that follows it. Every other entry in the file (for example the immediately-preceding entries on lines 2189, 2199, 2209) uses a real numeric PR id like 1256, 1065, or 1222.
Why existing checks don't catch it
The PR description itself flags this as a known TODO ("pr-link is TBD — fill in when this number is known"), so the author plans to fix it manually before merge. However, there is no CI gate that enforces this: the changelog schema in utils/matrix_logic/validation.py only validates that pr-link is a string, not that it is a valid URL or that it resolves to an existing PR. AGENTS.md documents using XXX/TBD as draft placeholders, but with no automated check the placeholder can silently slip through and become permanent in the merged changelog.
Impact
Low blast radius — the changelog is metadata, so no benchmark execution is affected. The cost is purely a broken hyperlink baked into git history, which any tool that walks the changelog (release notes generation, UI dashboards, manual auditors clicking through entries) will hit as a 404.
Step-by-step proof
- Open
perf-changelog.yamlat line 2219 — the value is the literal stringhttps://github.com/SemiAnalysisAI/InferenceX/pull/TBD. - Compare to the most recent prior entry at line 2209:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1222— a valid numeric id. - Resolving
/pull/TBDagainst any GitHub repo returns HTTP 404 because GitHub PR routes accept only integers. - Grep
utils/matrix_logic/validation.pyfor thepr-linkschema entry — the type check isstr, with no regex/URL validator. Soyaml.safe_loadsucceeds andvalidatesucceeds even withTBD. - Hence merging the PR as-is permanently encodes the broken link.
Fix
Replace TBD with 1282 (this PR's number) on line 2219:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1282Optionally, a follow-up could harden validation.py to require pr-link to match https://github\.com/.+/pull/\d+$ so future placeholders fail CI.
New benchmarks/single_node/dsv4_fp4_mi355x_vllm.sh follows the recipes.vllm.ai DeepSeek-V4-Pro AMD ROCm page (vllm-project/vllm#40871 base + #41217 MLA Indexer). Uses rocm/vllm-dev:deepseek-v4-latest directly so no runtime PR clone or pip rebuild is required, unlike the existing dsv4-fp8-mi355x-vllm overlay path. FP4 path: same DeepSeek-V4-Pro FP4+FP8 mixed checkpoint; --moe-backend triton_unfused consumes FP4 expert weights; KV cache fp8. Recipe flags: VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_LINEAR=1, --gpu-memory-utilization 0.6, --max-num-seqs 128 (auto-grows past CONC), --max-num-batched-tokens 8192, --distributed-executor-backend mp, --async-scheduling, --enforce-eager, deepseek_v4 tokenizer / reasoning / tool-call parsers. Adds: - benchmarks/single_node/dsv4_fp4_mi355x_vllm.sh - .github/configs/amd-master.yaml: dsv4-fp4-mi355x-vllm (TP=8, conc 1-128 for 1k1k and 8k1k) - perf-changelog.yaml entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rocm/vllm-dev:deepseek-v4-latest is a moving tag; pin to the digest we intend to validate against so reruns are reproducible. Matches the @sha256: format already used by the nvidia-master.yaml DSv4 entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
df168f4 to
a13d6a6
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25359905080 |
triton_unfused is not in the allowed list for FP4/FP8 MoE (ValueError: moe_backend='triton_unfused' is not supported). Switch to --moe-backend aiter which handles MXFP4 expert weights natively on ROCm gfx950 via the AITER stack already enabled by VLLM_ROCM_USE_AITER=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25363980691 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25363984774 |
Summary
benchmarks/single_node/dsv4_fp4_mi355x_vllm.shfollowing the recipes.vllm.ai DeepSeek-V4-Pro AMD ROCm page ([New Model][ROCm] Add AMD support for DeepSeek V4 vllm-project/vllm#40871 base + #41217 MLA Indexer optimization).rocm/vllm-dev:deepseek-v4-latestdirectly — no runtime PR clone / patch / pip rebuild (unlike the existingdsv4-fp8-mi355x-vllmoverlay path, which is built off PR #40889 stacked on #40871).deepseek-ai/DeepSeek-V4-ProFP4+FP8 mixed checkpoint;--moe-backend triton_unfusedconsumes FP4 expert weights directly. KV cache stays fp8.Recipe flags (from the docs, MI355X TP=8 validated):
VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_LINEAR=1--gpu-memory-utilization 0.6--max-num-seqs 128(auto-grows pastCONC),--max-num-batched-tokens 8192--distributed-executor-backend mp,--async-scheduling,--enforce-eager--tokenizer-mode deepseek_v4,--tool-call-parser deepseek_v4 --enable-auto-tool-choice,--reasoning-parser deepseek_v4Wires up
dsv4-fp4-mi355x-vllmin.github/configs/amd-master.yamlwith TP=8, CONC 1-128 for both 1k1k and 8k1k. Adds aperf-changelog.yamlentry (pr-link isTBD— fill in when this number is known).Test plan
python -c "import yaml; yaml.safe_load(open('.github/configs/amd-master.yaml'))"and same onperf-changelog.yaml(already verified locally).bash -n benchmarks/single_node/dsv4_fp4_mi355x_vllm.sh(already verified locally).wait_for_server_readysucceeds againstrocm/vllm-dev:deepseek-v4-latest.RUN_EVAL=trueand confirm exact-match parity with the docs (~0.954 flexible-extract).🤖 Generated with Claude Code