Skip to content

fix(cuda): increase argmax topk kernel limit from 32 to 64#25

Merged
Anbeeld merged 1 commit into
Anbeeld:mainfrom
chimpera:fix-argmax-topk-64
May 22, 2026
Merged

fix(cuda): increase argmax topk kernel limit from 32 to 64#25
Anbeeld merged 1 commit into
Anbeeld:mainfrom
chimpera:fix-argmax-topk-64

Conversation

@chimpera
Copy link
Copy Markdown

@chimpera chimpera commented May 22, 2026

Summary

  • Increases the custom topk_f32 CUDA kernel limit from K ≤ 32 to K ≤ 64 by expanding the fixed-size register arrays
  • Raises the CUB TopK auto-fallback threshold from K > 32 to K > 64 so the custom path covers the full top_k=64 range

Problem

On CUDA 12.x / pre-CCCL 3.2 Linux builds, the CUB TopK fallback (GGML_CUDA_DFLASH_CUB_TOP_K_AVAILABLE) is unavailable because it requires CCCL ≥ 3.2. The custom topk_f32 kernel is the only path, and it has a hardcoded K ≤ 32 limit.

The Gemma 4 GGUF bakes general.sampling.top_k = 64 into its metadata. The DFlash reduced verifier picks this up and passes K=64 to the target model's verification logits, which hits:

argmax.cu:557: GGML_ASSERT(K <= 32) failed

This crashes the server on the first inference request with any Gemma 4 model + DFlash on CUDA 12.x builds.

Fix

  • heap_val[32] / heap_idx[32]heap_val[64] / heap_idx[64] (register arrays, +256 bytes per thread)
  • GGML_ASSERT(K <= 32)GGML_ASSERT(K <= 64)
  • CUB auto-threshold: K > 32K > 64

Shared memory at K=64, 32 warps: ~16 KB — well within the 48 KB default limit. Register pressure increase is modest and causes no issues on tested hardware (RTX 5090, RTX 3090).

Test plan

  • Build succeeds on CUDA 12.x (GCC, Linux) with GGML_CUDA_FA=ON, GGML_CUDA_FA_ALL_QUANTS=ON
  • Launch Gemma 4 31B + DFlash draft model with default sampling (top_k=64 from metadata) and verify first inference completes without GGML_ASSERT
  • Run Qwen 3.6 27B + DFlash (top_k=20) to verify no regression on K ≤ 32 path

🤖 Generated with Claude Code

The custom topk_f32 CUDA kernel had a hardcoded K <= 32 limit from
fixed-size heap arrays. The Gemma 4 GGUF bakes top_k=64 into its
metadata (general.sampling.top_k), which the DFlash reduced verifier
passes through to the target model's verification logits as K=64.

On CUDA 12.x / pre-CCCL 3.2 builds, the CUB TopK fallback is
unavailable, so K=64 hits the custom path and crashes:

  argmax.cu:557: GGML_ASSERT(K <= 32) failed

Increase the register arrays and assertion to K <= 64. Shared memory
usage stays well within limits (16 KB at K=64, 32 warps). Bump the
CUB auto-threshold to K > 64 so the custom path covers the full
top_k=64 range.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Anbeeld
Copy link
Copy Markdown
Owner

Anbeeld commented May 22, 2026

Seems legit. Thanks!

@Anbeeld Anbeeld merged commit 07ac3ce into Anbeeld:main May 22, 2026
8 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants