[WIP][NV] Qwen3.5 fp4 b200 trt#1280
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349801570 |
|
@claude add PR description |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349802122 |
|
Claude finished @hshrivastava-droid's task in 45s —— View job Adding PR Description
PR description has been added with a summary of the changes, model/image/hardware details, benchmark search space tables for both 1k/1k and 8k/1k scenarios, and a per-file changelog. |
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi |
There was a problem hiding this comment.
🔴 EVAL_ONLY only updates MAX_MODEL_LEN, not MAX_NUM_TOKENS — diverges from every sibling TRT script (dsr1/dsv4/gptoss/h200 variants), which all set both to EVAL_MAX_MODEL_LEN inside the EVAL_ONLY block. As a secondary issue, even if MAX_NUM_TOKENS were updated here, the YAML on line 80-105 has already been written with the original value before the EVAL_ONLY block at line 127, so the on-disk config would still hold the stale max_num_tokens. Suggested fix: add MAX_NUM_TOKENS="$EVAL_MAX_MODEL_LEN" inside the EVAL_ONLY block AND move that block above the EXTRA_CONFIG_FILE write (or rewrite the YAML afterwards).
Extended reasoning...
What is wrong
In benchmarks/single_node/qwen3.5_fp4_b200_trt.sh lines 127-130:
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fiOnly MAX_MODEL_LEN is overridden. Every comparable TRT-LLM script in the repo overrides both:
| Script | MAX_MODEL_LEN | MAX_NUM_TOKENS |
|---|---|---|
dsr1_fp4_b200_trt.sh (80-84) |
yes | yes |
dsv4_fp4_b300_trt.sh (99-101) |
yes | yes |
gptoss_fp4_b200_trt.sh (81-85) |
yes | yes |
gptoss_fp4_h200_trt.sh (51-53) |
yes | yes |
dsr1_fp8_h200_trt.sh (68-70) |
yes | yes |
qwen3.5_fp4_b200_trt.sh (127-130) |
yes | no |
There is also an ordering bug unique to this script. The YAML at line 82-105 already embeds max_num_tokens: $MAX_NUM_TOKENS before the EVAL_ONLY block runs. The sibling scripts that share this pattern (dsr1_fp4_b200_trt.sh, gptoss_fp4_b200_trt.sh) deliberately do not put max_num_tokens in the YAML — they only pass it via the --max_num_tokens CLI flag, so a late variable update is sufficient. Qwen does both, so even a one-line fix to update the variable would leave the YAML on disk holding the stale value.
Step-by-step proof — EVAL_ONLY=true, ISL=1024, OSL=1024
- Lines 57-61:
MAX_NUM_TOKENS=16384. - Line 84: YAML written with
max_num_tokens: 16384. - Line 128:
setup_eval_contextcomputesEVAL_MAX_MODEL_LEN = 1024+1024+256 = 2304(seebenchmark_lib.sh:674-676). - Line 129:
MAX_MODEL_LEN=2304.MAX_NUM_TOKENSstays at16384. - Line 165-166: server launched with
--max_seq_len=2304 --max_num_tokens=16384.
For ISL=8192 it is even more pronounced: max_seq_len=9472, max_num_tokens=33792. Compare against gptoss_fp4_b200_trt.sh where the analogous flow ends with --max_seq_len=2304 --max_num_tokens=2304 because line 84 there also clamps MAX_NUM_TOKENS.
Addressing the refutation
The refuter argues that max_num_tokens is batch-level (not per-sequence) so a value larger than max_seq_len is a valid TRT-LLM config and not wasteful. That is technically true for steady-state correctness — the server will start and lm_eval will run. The refutation is right that this is not a crash. However:
- The cuda-graph capture set is sized off this value. Lines 100-102 declare
cuda_graph_config.batch_sizesup to 512, and TRT-LLM caps cuda-graph capture bymax_num_tokens. Leavingmax_num_tokens=33792whilemax_seq_len=9472causes EVAL_ONLY startup to capture graphs sized for the full benchmark scenario instead of the eval scenario, which is exactly what every other sibling script is trying to avoid. gptoss_fp4_b200_trt.shis a direct precedent, not justdsr1. The refuter dismissesdsr1because itsMAX_NUM_TOKENSis computed dynamically from CONC+ISL and so "makes sense" to override. Butgptoss_fp4_b200_trt.sh:79uses a hand-tuned static value (MAX_NUM_TOKENS=20000) just like qwen3.5, and still clamps it down in the EVAL_ONLY block (gptoss_fp4_b200_trt.sh:84). That removes the refuter's main escape hatch — the pattern applies regardless of how the original value was derived.- "YAML/CLI consistency is preserved" is technically true (both stay 16384/33792) but misses the goal of EVAL_ONLY mode: shrink the server to the eval workload. Both are wrong relative to that goal, so the fact that they are wrong in the same way is not a defense.
Severity
Marking normal rather than nit because (a) the pattern is unambiguous across 5+ sibling TRT scripts, (b) the YAML-vs-CLI ordering issue is a separate, qwen3.5-specific bug that any naive one-liner fix would still leave behind, and (c) the wasted cuda-graph capture meaningfully extends EVAL_ONLY startup. It does not corrupt results, however, so it is on the lower end of normal.
| - "Model: nvidia/Qwen3.5-397B-A17B-NVFP4" | ||
| - "1k1k: TP-only (tp4 conc 4-16, tp8 conc 4), TEP (tp4ep4 conc 16-256), DEP (tp4ep4 dp-attn conc 1024, tp8ep8 dp-attn conc 512-1024)" | ||
| - "8k1k: TP-only (tp2 conc 4-32, tp4 conc 4-8, tp8 conc 4), TEP (tp2ep2 conc 64, tp4ep4 conc 16), DEP (tp4ep4 dp-attn conc 256-1024, tp8ep8 dp-attn conc 512-1024)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 perf-changelog.yaml line 2219 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, a placeholder that should be replaced with the actual PR number (1280) before un-WIPing this PR. All other recent entries in the file (e.g. 1222, 1065, 1256) use real PR numbers, and process_changelog.py enforces append-only on this file, so a stray XXX would persist permanently as a broken link in the changelog.
Extended reasoning...
What the bug is
The new entry appended to perf-changelog.yaml (lines 2210-2219 in the diff) ends with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXThe literal string XXX is a placeholder rather than the real PR number. Every other recently-added entry in the file uses a real number — the immediately preceding entries point at pull/1222, pull/1065, and pull/1256, matching the merge commits visible in git log (c898aeb, a68d253, 0f630e1). For this PR the correct value is 1280.
Why this matters even though the PR is marked [WIP]
The PR title carries [WIP] which suggests the author intends to clean this up before merge. However, perf-changelog.yaml is governed by process_changelog.py, which enforces append-only semantics: existing entries cannot be edited or removed once committed. That means if the PR is merged with pull/XXX, the broken link is locked into the changelog history — there is no clean way to fix it after the fact without bending the very rule the file exists to enforce.
Impact
The link is purely metadata, so this does not affect benchmark execution, scheduling, or correctness of the new qwen3.5-fp4-b200-trt config. The impact is limited to documentation/auditability: anyone clicking through the changelog to find the PR that introduced this benchmark gets a 404 (/pull/XXX is not a valid GitHub PR URL), and the broken entry persists indefinitely because of the append-only rule.
Step-by-step proof
- Open
perf-changelog.yamland look at lines 2210-2219 in the diff — the new entry's last line readspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. - Compare with the three preceding entries in the file (visible in the surrounding context): they end with
pull/1222,pull/1065,pull/1256respectively — real, dereferenceable PR numbers. - From the PR metadata, this PR's number is
1280. - Visiting
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXresolves to a 404 sinceXXXis not a numeric id. process_changelog.pyrejects deletions/modifications toperf-changelog.yamlentries, so once merged this line cannot be cleanly amended.
Suggested fix
Before un-WIPing, change the last line of the new entry to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1280That single-character class of edit is exactly the kind of last-mile cleanup the [WIP] tag implies, but it's worth flagging explicitly so it doesn't slip through review.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349854643 |
Summary
Add TensorRT-LLM benchmark configuration for Qwen3.5-397B-A17B (FP4) on NVIDIA B200 GPUs.
Details
nvidia/Qwen3.5-397B-A17B-NVFP4nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc12trtllm-serve)Benchmark Search Space
1k/1k (ISL=1024, OSL=1024):
8k/1k (ISL=8192, OSL=1024):
Changes
benchmarks/single_node/qwen3.5_fp4_b200_trt.sh— New benchmark script usingtrtllm-servewith PyTorch backend. Supports TP-only, TEP, and DEP (dp-attn) modes with dynamic batch sizing, CUDA graph configuration, and MoE backend selection (CUTEDSL for dp-attn, TRTLLM otherwise). KV cache uses FP8 dtype..github/configs/nvidia-master.yaml— Addqwen3.5-fp4-b200-trtconfig entry with full search space.perf-changelog.yaml— Add changelog entry for the new benchmark config.