Skip to content

[WIP][NV] Qwen3.5 fp4 b200 trt#1280

Open
hshrivastava-droid wants to merge 2 commits intomainfrom
nv/qwen3.5-fp4-b200-trt_v2
Open

[WIP][NV] Qwen3.5 fp4 b200 trt#1280
hshrivastava-droid wants to merge 2 commits intomainfrom
nv/qwen3.5-fp4-b200-trt_v2

Conversation

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

@hshrivastava-droid hshrivastava-droid commented May 4, 2026

Summary

Add TensorRT-LLM benchmark configuration for Qwen3.5-397B-A17B (FP4) on NVIDIA B200 GPUs.

Details

  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4
  • Image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc12
  • Framework: TensorRT-LLM (PyTorch backend via trtllm-serve)
  • Precision: FP4 (NVFP4)
  • Hardware: B200 (single-node)

Benchmark Search Space

1k/1k (ISL=1024, OSL=1024):

Mode Configuration Concurrency
TP-only TP=4, EP=1 4–16
TP-only TP=8, EP=1 4
TEP TP=4, EP=4 16–256
DEP (dp-attn) TP=4, EP=4 1024
DEP (dp-attn) TP=8, EP=8 512–1024

8k/1k (ISL=8192, OSL=1024):

Mode Configuration Concurrency
TP-only TP=2, EP=1 4–32
TP-only TP=4, EP=1 4–8
TP-only TP=8, EP=1 4
TEP TP=2, EP=2 64
TEP TP=4, EP=4 16
DEP (dp-attn) TP=4, EP=4 256–1024
DEP (dp-attn) TP=8, EP=8 512–1024

Changes

  • benchmarks/single_node/qwen3.5_fp4_b200_trt.sh — New benchmark script using trtllm-serve with PyTorch backend. Supports TP-only, TEP, and DEP (dp-attn) modes with dynamic batch sizing, CUDA graph configuration, and MoE backend selection (CUTEDSL for dp-attn, TRTLLM otherwise). KV cache uses FP8 dtype.
  • .github/configs/nvidia-master.yaml — Add qwen3.5-fp4-b200-trt config entry with full search space.
  • perf-changelog.yaml — Add changelog entry for the new benchmark config.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

@claude add PR description

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 4, 2026

Claude finished @hshrivastava-droid's task in 45s —— View job


Adding PR Description

  • Read changed files to understand the PR
  • Write and update PR description

PR description has been added with a summary of the changes, model/image/hardware details, benchmark search space tables for both 1k/1k and 8k/1k scenarios, and a per-file changelog.

Comment on lines +127 to +130
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 EVAL_ONLY only updates MAX_MODEL_LEN, not MAX_NUM_TOKENS — diverges from every sibling TRT script (dsr1/dsv4/gptoss/h200 variants), which all set both to EVAL_MAX_MODEL_LEN inside the EVAL_ONLY block. As a secondary issue, even if MAX_NUM_TOKENS were updated here, the YAML on line 80-105 has already been written with the original value before the EVAL_ONLY block at line 127, so the on-disk config would still hold the stale max_num_tokens. Suggested fix: add MAX_NUM_TOKENS="$EVAL_MAX_MODEL_LEN" inside the EVAL_ONLY block AND move that block above the EXTRA_CONFIG_FILE write (or rewrite the YAML afterwards).

Extended reasoning...

What is wrong

In benchmarks/single_node/qwen3.5_fp4_b200_trt.sh lines 127-130:

if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi

Only MAX_MODEL_LEN is overridden. Every comparable TRT-LLM script in the repo overrides both:

Script MAX_MODEL_LEN MAX_NUM_TOKENS
dsr1_fp4_b200_trt.sh (80-84) yes yes
dsv4_fp4_b300_trt.sh (99-101) yes yes
gptoss_fp4_b200_trt.sh (81-85) yes yes
gptoss_fp4_h200_trt.sh (51-53) yes yes
dsr1_fp8_h200_trt.sh (68-70) yes yes
qwen3.5_fp4_b200_trt.sh (127-130) yes no

There is also an ordering bug unique to this script. The YAML at line 82-105 already embeds max_num_tokens: $MAX_NUM_TOKENS before the EVAL_ONLY block runs. The sibling scripts that share this pattern (dsr1_fp4_b200_trt.sh, gptoss_fp4_b200_trt.sh) deliberately do not put max_num_tokens in the YAML — they only pass it via the --max_num_tokens CLI flag, so a late variable update is sufficient. Qwen does both, so even a one-line fix to update the variable would leave the YAML on disk holding the stale value.

Step-by-step proof — EVAL_ONLY=true, ISL=1024, OSL=1024

  1. Lines 57-61: MAX_NUM_TOKENS=16384.
  2. Line 84: YAML written with max_num_tokens: 16384.
  3. Line 128: setup_eval_context computes EVAL_MAX_MODEL_LEN = 1024+1024+256 = 2304 (see benchmark_lib.sh:674-676).
  4. Line 129: MAX_MODEL_LEN=2304. MAX_NUM_TOKENS stays at 16384.
  5. Line 165-166: server launched with --max_seq_len=2304 --max_num_tokens=16384.

For ISL=8192 it is even more pronounced: max_seq_len=9472, max_num_tokens=33792. Compare against gptoss_fp4_b200_trt.sh where the analogous flow ends with --max_seq_len=2304 --max_num_tokens=2304 because line 84 there also clamps MAX_NUM_TOKENS.

Addressing the refutation

The refuter argues that max_num_tokens is batch-level (not per-sequence) so a value larger than max_seq_len is a valid TRT-LLM config and not wasteful. That is technically true for steady-state correctness — the server will start and lm_eval will run. The refutation is right that this is not a crash. However:

  • The cuda-graph capture set is sized off this value. Lines 100-102 declare cuda_graph_config.batch_sizes up to 512, and TRT-LLM caps cuda-graph capture by max_num_tokens. Leaving max_num_tokens=33792 while max_seq_len=9472 causes EVAL_ONLY startup to capture graphs sized for the full benchmark scenario instead of the eval scenario, which is exactly what every other sibling script is trying to avoid.
  • gptoss_fp4_b200_trt.sh is a direct precedent, not just dsr1. The refuter dismisses dsr1 because its MAX_NUM_TOKENS is computed dynamically from CONC+ISL and so "makes sense" to override. But gptoss_fp4_b200_trt.sh:79 uses a hand-tuned static value (MAX_NUM_TOKENS=20000) just like qwen3.5, and still clamps it down in the EVAL_ONLY block (gptoss_fp4_b200_trt.sh:84). That removes the refuter's main escape hatch — the pattern applies regardless of how the original value was derived.
  • "YAML/CLI consistency is preserved" is technically true (both stay 16384/33792) but misses the goal of EVAL_ONLY mode: shrink the server to the eval workload. Both are wrong relative to that goal, so the fact that they are wrong in the same way is not a defense.

Severity

Marking normal rather than nit because (a) the pattern is unambiguous across 5+ sibling TRT scripts, (b) the YAML-vs-CLI ordering issue is a separate, qwen3.5-specific bug that any naive one-liner fix would still leave behind, and (c) the wasted cuda-graph capture meaningfully extends EVAL_ONLY startup. It does not corrupt results, however, so it is on the lower end of normal.

Comment thread perf-changelog.yaml Outdated
- "Model: nvidia/Qwen3.5-397B-A17B-NVFP4"
- "1k1k: TP-only (tp4 conc 4-16, tp8 conc 4), TEP (tp4ep4 conc 16-256), DEP (tp4ep4 dp-attn conc 1024, tp8ep8 dp-attn conc 512-1024)"
- "8k1k: TP-only (tp2 conc 4-32, tp4 conc 4-8, tp8 conc 4), TEP (tp2ep2 conc 64, tp4ep4 conc 16), DEP (tp4ep4 dp-attn conc 256-1024, tp8ep8 dp-attn conc 512-1024)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 perf-changelog.yaml line 2219 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, a placeholder that should be replaced with the actual PR number (1280) before un-WIPing this PR. All other recent entries in the file (e.g. 1222, 1065, 1256) use real PR numbers, and process_changelog.py enforces append-only on this file, so a stray XXX would persist permanently as a broken link in the changelog.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (lines 2210-2219 in the diff) ends with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The literal string XXX is a placeholder rather than the real PR number. Every other recently-added entry in the file uses a real number — the immediately preceding entries point at pull/1222, pull/1065, and pull/1256, matching the merge commits visible in git log (c898aeb, a68d253, 0f630e1). For this PR the correct value is 1280.

Why this matters even though the PR is marked [WIP]

The PR title carries [WIP] which suggests the author intends to clean this up before merge. However, perf-changelog.yaml is governed by process_changelog.py, which enforces append-only semantics: existing entries cannot be edited or removed once committed. That means if the PR is merged with pull/XXX, the broken link is locked into the changelog history — there is no clean way to fix it after the fact without bending the very rule the file exists to enforce.

Impact

The link is purely metadata, so this does not affect benchmark execution, scheduling, or correctness of the new qwen3.5-fp4-b200-trt config. The impact is limited to documentation/auditability: anyone clicking through the changelog to find the PR that introduced this benchmark gets a 404 (/pull/XXX is not a valid GitHub PR URL), and the broken entry persists indefinitely because of the append-only rule.

Step-by-step proof

  1. Open perf-changelog.yaml and look at lines 2210-2219 in the diff — the new entry's last line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  2. Compare with the three preceding entries in the file (visible in the surrounding context): they end with pull/1222, pull/1065, pull/1256 respectively — real, dereferenceable PR numbers.
  3. From the PR metadata, this PR's number is 1280.
  4. Visiting https://github.com/SemiAnalysisAI/InferenceX/pull/XXX resolves to a 404 since XXX is not a numeric id.
  5. process_changelog.py rejects deletions/modifications to perf-changelog.yaml entries, so once merged this line cannot be cleanly amended.

Suggested fix

Before un-WIPing, change the last line of the new entry to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1280

That single-character class of edit is exactly the kind of last-mile cleanup the [WIP] tag implies, but it's worth flagging explicitly so it doesn't slip through review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants