Skip to content

[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM#19785

Merged
tqchen merged 1 commit into
apache:mainfrom
spectrometerHBH:apache-upstream-jun15
Jun 16, 2026
Merged

[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM#19785
tqchen merged 1 commit into
apache:mainfrom
spectrometerHBH:apache-upstream-jun15

Conversation

@spectrometerHBH

@spectrometerHBH spectrometerHBH commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Batch of tirx CUDA backend framework updates, on top of latest main:

  • FA4: env-driven ptxas register level and scheduler num_ctas support.
  • CLC: clusterlaunchcontrol device intrinsics and a CLC-based tile scheduler.
  • nvfp4: framework support for nvfp4 tcgen05 GEMM.
  • Elementwise: scope-level operands for warp/wg/cta register elementwise ops.
  • LLVM codegen: diagnostic for duplicate PrimFunc global symbols.
  • CUDA: default device-code compilation to NVRTC.

Testing

Tests under tests/python/tirx pass locally on sm_100a (B200).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new pre-commit kernel regression benchmark (tir-bench) with automatic GPU selection and ratio-based reporting, adds strict kernel-import checking to the test suite, and implements Blackwell Cluster Launch Control (CLC) work-stealing tile scheduling. It also switches the default CUDA compiler backend to NVRTC with several compatibility fixes, adds a duplicate global_symbol check in LLVM codegen, and supports column-slice loads for wider frags. The code review feedback correctly identifies several critical issues: a potential orphaned process leak in the benchmark monitor, a PTX assembly predicate bug that incorrectly handles non-one truthy values, a TypeError when slicing symbolic extents, and platform-compatibility issues on non-Linux or Windows systems regarding /proc access and symlink creation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread .claude/commands/tir-bench/run.py Outdated
Comment thread python/tvm/backend/cuda/operator/intrinsics/sync.py
Comment thread .claude/commands/tir-bench/run.py Outdated
Comment thread .claude/commands/tir-bench/run.py Outdated
@spectrometerHBH spectrometerHBH force-pushed the apache-upstream-jun15 branch from de9b0a2 to e1acbcb Compare June 15, 2026 23:28
@spectrometerHBH spectrometerHBH changed the title Forward-port tirx framework updates (FA4, CLC, nvfp4 GEMM, codegen diagnostics) [TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM Jun 15, 2026
@spectrometerHBH spectrometerHBH force-pushed the apache-upstream-jun15 branch from e1acbcb to c036863 Compare June 15, 2026 23:31
@spectrometerHBH spectrometerHBH force-pushed the apache-upstream-jun15 branch 4 times, most recently from 80fa3f9 to 4cb04ef Compare June 16, 2026 05:38
…en05 GEMM

Batch of tirx CUDA backend framework updates:

- FA4: env-driven ptxas register level and scheduler num_ctas support.
- clusterlaunchcontrol (CLC) device intrinsics and a CLC-based tile scheduler.
- Framework support for nvfp4 tcgen05 GEMM.
- Scope-level operands for warp/wg/cta register elementwise ops.
- LLVM codegen diagnostic for duplicate PrimFunc global symbols.
- Default CUDA compilation to NVRTC.

Robustness / tests:

- Cross-CTA mbarrier arrive intrinsics: guard with `setp.ne.s32 p, %2, 0`
  instead of `setp.eq.u32 p, %2, 1` so any non-zero `int pred` is treated as
  true (C boolean semantics), matching the `int pred` signature.
- Harden the NVRTC path to always define the vector-deprecation silencing
  macros, so device-code compilation does not depend on which CUDA header
  include chain is pulled in.
- Wire tests/python/tirx into the unittest CI task. The suite targets
  Blackwell (sm_100a); a directory conftest gates it on a real sm_100a device
  so it skips cleanly on CPU nodes / pre-sm_100 GPUs (where ptxas/NVRTC would
  otherwise reject tcgen05 / cp.async `.async` / fp8) and runs in full where
  the hardware is present.
- Add `gpu` markers and CUDA compute-capability skipifs across the tirx tests.

Tests under tests/python/tirx pass locally on sm_100a (B200).

Signed-off-by: spectrometerHBH <bohanhou@andrew.cmu.edu>
@spectrometerHBH spectrometerHBH force-pushed the apache-upstream-jun15 branch from 4cb04ef to 8b34b14 Compare June 16, 2026 06:47
@tqchen tqchen merged commit 16d0a7e into apache:main Jun 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants