[TIRX][CUDA] Framework support for FA4, CLC intrinsics, and nvfp4 tcgen05 GEMM#19785
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new pre-commit kernel regression benchmark (tir-bench) with automatic GPU selection and ratio-based reporting, adds strict kernel-import checking to the test suite, and implements Blackwell Cluster Launch Control (CLC) work-stealing tile scheduling. It also switches the default CUDA compiler backend to NVRTC with several compatibility fixes, adds a duplicate global_symbol check in LLVM codegen, and supports column-slice loads for wider frags. The code review feedback correctly identifies several critical issues: a potential orphaned process leak in the benchmark monitor, a PTX assembly predicate bug that incorrectly handles non-one truthy values, a TypeError when slicing symbolic extents, and platform-compatibility issues on non-Linux or Windows systems regarding /proc access and symlink creation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
de9b0a2 to
e1acbcb
Compare
e1acbcb to
c036863
Compare
80fa3f9 to
4cb04ef
Compare
…en05 GEMM Batch of tirx CUDA backend framework updates: - FA4: env-driven ptxas register level and scheduler num_ctas support. - clusterlaunchcontrol (CLC) device intrinsics and a CLC-based tile scheduler. - Framework support for nvfp4 tcgen05 GEMM. - Scope-level operands for warp/wg/cta register elementwise ops. - LLVM codegen diagnostic for duplicate PrimFunc global symbols. - Default CUDA compilation to NVRTC. Robustness / tests: - Cross-CTA mbarrier arrive intrinsics: guard with `setp.ne.s32 p, %2, 0` instead of `setp.eq.u32 p, %2, 1` so any non-zero `int pred` is treated as true (C boolean semantics), matching the `int pred` signature. - Harden the NVRTC path to always define the vector-deprecation silencing macros, so device-code compilation does not depend on which CUDA header include chain is pulled in. - Wire tests/python/tirx into the unittest CI task. The suite targets Blackwell (sm_100a); a directory conftest gates it on a real sm_100a device so it skips cleanly on CPU nodes / pre-sm_100 GPUs (where ptxas/NVRTC would otherwise reject tcgen05 / cp.async `.async` / fp8) and runs in full where the hardware is present. - Add `gpu` markers and CUDA compute-capability skipifs across the tirx tests. Tests under tests/python/tirx pass locally on sm_100a (B200). Signed-off-by: spectrometerHBH <bohanhou@andrew.cmu.edu>
4cb04ef to
8b34b14
Compare
Summary
Batch of tirx CUDA backend framework updates, on top of latest
main:num_ctassupport.PrimFuncglobal symbols.Testing
Tests under
tests/python/tirxpass locally on sm_100a (B200).