[WIP][NV] Qwen3.5 fp4 b200 trt by hshrivastava-droid · Pull Request #1280 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-05-04T23:46:40Z

Summary

Add TensorRT-LLM benchmark configuration for Qwen3.5-397B-A17B (FP4) on NVIDIA B200 GPUs.

Details

Model: nvidia/Qwen3.5-397B-A17B-NVFP4
Image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc12
Framework: TensorRT-LLM (PyTorch backend via trtllm-serve)
Precision: FP4 (NVFP4)
Hardware: B200 (single-node)

Benchmark Search Space

1k/1k (ISL=1024, OSL=1024):

Mode	Configuration	Concurrency
TP-only	TP=4, EP=1	4–16
TP-only	TP=8, EP=1	4
TEP	TP=4, EP=4	16–256
DEP (dp-attn)	TP=4, EP=4	1024
DEP (dp-attn)	TP=8, EP=8	512–1024

8k/1k (ISL=8192, OSL=1024):

Mode	Configuration	Concurrency
TP-only	TP=2, EP=1	4–32
TP-only	TP=4, EP=1	4–8
TP-only	TP=8, EP=1	4
TEP	TP=2, EP=2	64
TEP	TP=4, EP=4	16
DEP (dp-attn)	TP=4, EP=4	256–1024
DEP (dp-attn)	TP=8, EP=8	512–1024

Changes

benchmarks/single_node/qwen3.5_fp4_b200_trt.sh — New benchmark script using trtllm-serve with PyTorch backend. Supports TP-only, TEP, and DEP (dp-attn) modes with dynamic batch sizing, CUDA graph configuration, and MoE backend selection (CUTEDSL for dp-attn, TRTLLM otherwise). KV cache uses FP8 dtype.
.github/configs/nvidia-master.yaml — Add qwen3.5-fp4-b200-trt config entry with full search space.
perf-changelog.yaml — Add changelog entry for the new benchmark config.

github-actions · 2026-05-04T23:46:49Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-04T23:46:50Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-04T23:46:50Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-04T23:47:13Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349801570
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25349801570

hshrivastava-droid · 2026-05-04T23:52:25Z

@claude add PR description

github-actions · 2026-05-04T23:52:47Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349802122
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25349802122

Klaud-Cold · 2026-05-04T23:52:53Z

Claude finished @hshrivastava-droid's task in 45s —— View job

Adding PR Description

Read changed files to understand the PR
Write and update PR description

PR description has been added with a summary of the changes, model/image/hardware details, benchmark search space tables for both 1k/1k and 8k/1k scenarios, and a per-file changelog.

claude · 2026-05-04T23:57:39Z

+if [ "${EVAL_ONLY}" = "true" ]; then
+    setup_eval_context
+    MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
+fi


🔴 EVAL_ONLY only updates MAX_MODEL_LEN, not MAX_NUM_TOKENS — diverges from every sibling TRT script (dsr1/dsv4/gptoss/h200 variants), which all set both to EVAL_MAX_MODEL_LEN inside the EVAL_ONLY block. As a secondary issue, even if MAX_NUM_TOKENS were updated here, the YAML on line 80-105 has already been written with the original value before the EVAL_ONLY block at line 127, so the on-disk config would still hold the stale max_num_tokens. Suggested fix: add MAX_NUM_TOKENS="$EVAL_MAX_MODEL_LEN" inside the EVAL_ONLY block AND move that block above the EXTRA_CONFIG_FILE write (or rewrite the YAML afterwards).

Extended reasoning...

What is wrong

In benchmarks/single_node/qwen3.5_fp4_b200_trt.sh lines 127-130:

if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" fi

Only MAX_MODEL_LEN is overridden. Every comparable TRT-LLM script in the repo overrides both:

Script MAX_MODEL_LEN MAX_NUM_TOKENS

dsr1_fp4_b200_trt.sh (80-84) yes yes

dsv4_fp4_b300_trt.sh (99-101) yes yes

gptoss_fp4_b200_trt.sh (81-85) yes yes

gptoss_fp4_h200_trt.sh (51-53) yes yes

dsr1_fp8_h200_trt.sh (68-70) yes yes

qwen3.5_fp4_b200_trt.sh (127-130) yes no

There is also an ordering bug unique to this script. The YAML at line 82-105 already embeds max_num_tokens: $MAX_NUM_TOKENS before the EVAL_ONLY block runs. The sibling scripts that share this pattern (dsr1_fp4_b200_trt.sh, gptoss_fp4_b200_trt.sh) deliberately do not put max_num_tokens in the YAML — they only pass it via the --max_num_tokens CLI flag, so a late variable update is sufficient. Qwen does both, so even a one-line fix to update the variable would leave the YAML on disk holding the stale value.

Step-by-step proof — EVAL_ONLY=true, ISL=1024, OSL=1024

Lines 57-61: MAX_NUM_TOKENS=16384.

Line 84: YAML written with max_num_tokens: 16384.

Line 128: setup_eval_context computes EVAL_MAX_MODEL_LEN = 1024+1024+256 = 2304 (see benchmark_lib.sh:674-676).

Line 129: MAX_MODEL_LEN=2304. MAX_NUM_TOKENS stays at 16384.

Line 165-166: server launched with --max_seq_len=2304 --max_num_tokens=16384.

For ISL=8192 it is even more pronounced: max_seq_len=9472, max_num_tokens=33792. Compare against gptoss_fp4_b200_trt.sh where the analogous flow ends with --max_seq_len=2304 --max_num_tokens=2304 because line 84 there also clamps MAX_NUM_TOKENS.

Addressing the refutation

The refuter argues that max_num_tokens is batch-level (not per-sequence) so a value larger than max_seq_len is a valid TRT-LLM config and not wasteful. That is technically true for steady-state correctness — the server will start and lm_eval will run. The refutation is right that this is not a crash. However:

The cuda-graph capture set is sized off this value. Lines 100-102 declare cuda_graph_config.batch_sizes up to 512, and TRT-LLM caps cuda-graph capture by max_num_tokens. Leaving max_num_tokens=33792 while max_seq_len=9472 causes EVAL_ONLY startup to capture graphs sized for the full benchmark scenario instead of the eval scenario, which is exactly what every other sibling script is trying to avoid.

gptoss_fp4_b200_trt.sh is a direct precedent, not just dsr1. The refuter dismisses dsr1 because its MAX_NUM_TOKENS is computed dynamically from CONC+ISL and so "makes sense" to override. But gptoss_fp4_b200_trt.sh:79 uses a hand-tuned static value (MAX_NUM_TOKENS=20000) just like qwen3.5, and still clamps it down in the EVAL_ONLY block (gptoss_fp4_b200_trt.sh:84). That removes the refuter's main escape hatch — the pattern applies regardless of how the original value was derived.

"YAML/CLI consistency is preserved" is technically true (both stay 16384/33792) but misses the goal of EVAL_ONLY mode: shrink the server to the eval workload. Both are wrong relative to that goal, so the fact that they are wrong in the same way is not a defense.

Severity

Marking normal rather than nit because (a) the pattern is unambiguous across 5+ sibling TRT scripts, (b) the YAML-vs-CLI ordering issue is a separate, qwen3.5-specific bug that any naive one-liner fix would still leave behind, and (c) the wasted cuda-graph capture meaningfully extends EVAL_ONLY startup. It does not corrupt results, however, so it is on the lower end of normal.

claude · 2026-05-04T23:57:39Z

+    - "Model: nvidia/Qwen3.5-397B-A17B-NVFP4"
+    - "1k1k: TP-only (tp4 conc 4-16, tp8 conc 4), TEP (tp4ep4 conc 16-256), DEP (tp4ep4 dp-attn conc 1024, tp8ep8 dp-attn conc 512-1024)"
+    - "8k1k: TP-only (tp2 conc 4-32, tp4 conc 4-8, tp8 conc 4), TEP (tp2ep2 conc 64, tp4ep4 conc 16), DEP (tp4ep4 dp-attn conc 256-1024, tp8ep8 dp-attn conc 512-1024)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX


🟡 perf-changelog.yaml line 2219 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, a placeholder that should be replaced with the actual PR number (1280) before un-WIPing this PR. All other recent entries in the file (e.g. 1222, 1065, 1256) use real PR numbers, and process_changelog.py enforces append-only on this file, so a stray XXX would persist permanently as a broken link in the changelog.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (lines 2210-2219 in the diff) ends with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The literal string XXX is a placeholder rather than the real PR number. Every other recently-added entry in the file uses a real number — the immediately preceding entries point at pull/1222, pull/1065, and pull/1256, matching the merge commits visible in git log (c898aeb, a68d253, 0f630e1). For this PR the correct value is 1280.

Why this matters even though the PR is marked [WIP]

The PR title carries [WIP] which suggests the author intends to clean this up before merge. However, perf-changelog.yaml is governed by process_changelog.py, which enforces append-only semantics: existing entries cannot be edited or removed once committed. That means if the PR is merged with pull/XXX, the broken link is locked into the changelog history — there is no clean way to fix it after the fact without bending the very rule the file exists to enforce.

Impact

The link is purely metadata, so this does not affect benchmark execution, scheduling, or correctness of the new qwen3.5-fp4-b200-trt config. The impact is limited to documentation/auditability: anyone clicking through the changelog to find the PR that introduced this benchmark gets a 404 (/pull/XXX is not a valid GitHub PR URL), and the broken entry persists indefinitely because of the append-only rule.

Step-by-step proof

Open perf-changelog.yaml and look at lines 2210-2219 in the diff — the new entry's last line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Compare with the three preceding entries in the file (visible in the surrounding context): they end with pull/1222, pull/1065, pull/1256 respectively — real, dereferenceable PR numbers.

From the PR metadata, this PR's number is 1280.

Visiting https://github.com/SemiAnalysisAI/InferenceX/pull/XXX resolves to a 404 since XXX is not a numeric id.

process_changelog.py rejects deletions/modifications to perf-changelog.yaml entries, so once merged this line cannot be cleanly amended.

Suggested fix

Before un-WIPing, change the last line of the new entry to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1280

That single-character class of edit is exactly the kind of last-mile cleanup the [WIP] tag implies, but it's worth flagging explicitly so it doesn't slip through review.

github-actions · 2026-05-05T00:34:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349854643
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25349854643

add qwen3.5_fp4_b200_trt

87af611

hshrivastava-droid requested a review from a team May 4, 2026 23:46

hshrivastava-droid added the NVIDIA label May 4, 2026

hshrivastava-droid requested a review from kedarpotdar-nv as a code owner May 4, 2026 23:46

hshrivastava-droid added the full-sweep-enabled label May 4, 2026

hshrivastava-droid requested a review from jgangani as a code owner May 4, 2026 23:46

github-project-automation Bot added this to InferenceMAX Board May 4, 2026

update PR number

4f20519

claude Bot reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][NV] Qwen3.5 fp4 b200 trt#1280

[WIP][NV] Qwen3.5 fp4 b200 trt#1280
hshrivastava-droid wants to merge 2 commits intomainfrom
nv/qwen3.5-fp4-b200-trt_v2

hshrivastava-droid commented May 4, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

hshrivastava-droid commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Klaud-Cold commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot May 4, 2026

Uh oh!

claude Bot May 4, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Script	MAX_MODEL_LEN	MAX_NUM_TOKENS
`dsr1_fp4_b200_trt.sh` (80-84)	yes	yes
`dsv4_fp4_b300_trt.sh` (99-101)	yes	yes
`gptoss_fp4_b200_trt.sh` (81-85)	yes	yes
`gptoss_fp4_h200_trt.sh` (51-53)	yes	yes
`dsr1_fp8_h200_trt.sh` (68-70)	yes	yes
`qwen3.5_fp4_b200_trt.sh` (127-130)	yes	no

Conversation

hshrivastava-droid commented May 4, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Benchmark Search Space

Changes

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

hshrivastava-droid commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Klaud-Cold commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

What is wrong

Step-by-step proof — EVAL_ONLY=true, ISL=1024, OSL=1024

Addressing the refutation

Severity

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters even though the PR is marked [WIP]

Impact

Step-by-step proof

Suggested fix

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hshrivastava-droid commented May 4, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented May 4, 2026 •

edited

Loading

Step-by-step proof — `EVAL_ONLY=true`, ISL=1024, OSL=1024