[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM by spectrometerHBH · Pull Request #19785 · apache/tvm

spectrometerHBH · 2026-06-15T23:23:54Z

Summary

Batch of tirx CUDA backend framework updates, on top of latest main:

FA4: env-driven ptxas register level and scheduler num_ctas support.
CLC: clusterlaunchcontrol device intrinsics and a CLC-based tile scheduler.
nvfp4: framework support for nvfp4 tcgen05 GEMM.
Elementwise: scope-level operands for warp/wg/cta register elementwise ops.
LLVM codegen: diagnostic for duplicate PrimFunc global symbols.
CUDA: default device-code compilation to NVRTC.

Testing

Tests under tests/python/tirx pass locally on sm_100a (B200).

gemini-code-assist

Code Review

This pull request introduces a new pre-commit kernel regression benchmark (tir-bench) with automatic GPU selection and ratio-based reporting, adds strict kernel-import checking to the test suite, and implements Blackwell Cluster Launch Control (CLC) work-stealing tile scheduling. It also switches the default CUDA compiler backend to NVRTC with several compatibility fixes, adds a duplicate global_symbol check in LLVM codegen, and supports column-slice loads for wider frags. The code review feedback correctly identifies several critical issues: a potential orphaned process leak in the benchmark monitor, a PTX assembly predicate bug that incorrectly handles non-one truthy values, a TypeError when slicing symbolic extents, and platform-compatibility issues on non-Linux or Windows systems regarding /proc access and symlink creation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…en05 GEMM Batch of tirx CUDA backend framework updates: - FA4: env-driven ptxas register level and scheduler num_ctas support. - clusterlaunchcontrol (CLC) device intrinsics and a CLC-based tile scheduler. - Framework support for nvfp4 tcgen05 GEMM. - Scope-level operands for warp/wg/cta register elementwise ops. - LLVM codegen diagnostic for duplicate PrimFunc global symbols. - Default CUDA compilation to NVRTC. Robustness / tests: - Cross-CTA mbarrier arrive intrinsics: guard with `setp.ne.s32 p, %2, 0` instead of `setp.eq.u32 p, %2, 1` so any non-zero `int pred` is treated as true (C boolean semantics), matching the `int pred` signature. - Harden the NVRTC path to always define the vector-deprecation silencing macros, so device-code compilation does not depend on which CUDA header include chain is pulled in. - Wire tests/python/tirx into the unittest CI task. The suite targets Blackwell (sm_100a); a directory conftest gates it on a real sm_100a device so it skips cleanly on CPU nodes / pre-sm_100 GPUs (where ptxas/NVRTC would otherwise reject tcgen05 / cp.async `.async` / fp8) and runs in full where the hardware is present. - Add `gpu` markers and CUDA compute-capability skipifs across the tirx tests. Tests under tests/python/tirx pass locally on sm_100a (B200). Signed-off-by: spectrometerHBH <bohanhou@andrew.cmu.edu>

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

spectrometerHBH force-pushed the apache-upstream-jun15 branch from de9b0a2 to e1acbcb Compare June 15, 2026 23:28

spectrometerHBH changed the title ~~Forward-port tirx framework updates (FA4, CLC, nvfp4 GEMM, codegen diagnostics)~~ [TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM Jun 15, 2026

spectrometerHBH force-pushed the apache-upstream-jun15 branch from e1acbcb to c036863 Compare June 15, 2026 23:31

tqchen approved these changes Jun 16, 2026

View reviewed changes

spectrometerHBH force-pushed the apache-upstream-jun15 branch 4 times, most recently from 80fa3f9 to 4cb04ef Compare June 16, 2026 05:38

spectrometerHBH force-pushed the apache-upstream-jun15 branch from 4cb04ef to 8b34b14 Compare June 16, 2026 06:47

tqchen merged commit 16d0a7e into apache:main Jun 16, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM#19785

[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM#19785
tqchen merged 1 commit into
apache:mainfrom
spectrometerHBH:apache-upstream-jun15

spectrometerHBH commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spectrometerHBH commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spectrometerHBH commented Jun 15, 2026 •

edited

Loading