fix(cuda): increase argmax topk kernel limit from 32 to 64 by chimpera · Pull Request #25 · Anbeeld/beellama.cpp

chimpera · 2026-05-22T19:19:14Z

Summary

Increases the custom topk_f32 CUDA kernel limit from K ≤ 32 to K ≤ 64 by expanding the fixed-size register arrays
Raises the CUB TopK auto-fallback threshold from K > 32 to K > 64 so the custom path covers the full top_k=64 range

Problem

On CUDA 12.x / pre-CCCL 3.2 Linux builds, the CUB TopK fallback (GGML_CUDA_DFLASH_CUB_TOP_K_AVAILABLE) is unavailable because it requires CCCL ≥ 3.2. The custom topk_f32 kernel is the only path, and it has a hardcoded K ≤ 32 limit.

The Gemma 4 GGUF bakes general.sampling.top_k = 64 into its metadata. The DFlash reduced verifier picks this up and passes K=64 to the target model's verification logits, which hits:

argmax.cu:557: GGML_ASSERT(K <= 32) failed

This crashes the server on the first inference request with any Gemma 4 model + DFlash on CUDA 12.x builds.

Fix

heap_val[32] / heap_idx[32] → heap_val[64] / heap_idx[64] (register arrays, +256 bytes per thread)
GGML_ASSERT(K <= 32) → GGML_ASSERT(K <= 64)
CUB auto-threshold: K > 32 → K > 64

Shared memory at K=64, 32 warps: ~16 KB — well within the 48 KB default limit. Register pressure increase is modest and causes no issues on tested hardware (RTX 5090, RTX 3090).

Test plan

Build succeeds on CUDA 12.x (GCC, Linux) with GGML_CUDA_FA=ON, GGML_CUDA_FA_ALL_QUANTS=ON
Launch Gemma 4 31B + DFlash draft model with default sampling (top_k=64 from metadata) and verify first inference completes without GGML_ASSERT
Run Qwen 3.6 27B + DFlash (top_k=20) to verify no regression on K ≤ 32 path

🤖 Generated with Claude Code

The custom topk_f32 CUDA kernel had a hardcoded K <= 32 limit from fixed-size heap arrays. The Gemma 4 GGUF bakes top_k=64 into its metadata (general.sampling.top_k), which the DFlash reduced verifier passes through to the target model's verification logits as K=64. On CUDA 12.x / pre-CCCL 3.2 builds, the CUB TopK fallback is unavailable, so K=64 hits the custom path and crashes: argmax.cu:557: GGML_ASSERT(K <= 32) failed Increase the register arrays and assertion to K <= 64. Shared memory usage stays well within limits (16 KB at K=64, 32 warps). Bump the CUB auto-threshold to K > 64 so the custom path covers the full top_k=64 range. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Anbeeld · 2026-05-22T20:15:19Z

Seems legit. Thanks!

github-actions Bot added Nvidia GPU ggml labels May 22, 2026

Anbeeld merged commit 07ac3ce into Anbeeld:main May 22, 2026
8 of 47 checks passed

Hrsh-Venket mentioned this pull request May 23, 2026

Compile bug: Metal library fails to compile — block_turbo4_0 missing signs / rnorm fields referenced by quantize_turbo4_0 #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cuda): increase argmax topk kernel limit from 32 to 64#25

fix(cuda): increase argmax topk kernel limit from 32 to 64#25
Anbeeld merged 1 commit into
Anbeeld:mainfrom
chimpera:fix-argmax-topk-64

chimpera commented May 22, 2026 •

edited

Loading

Uh oh!

Anbeeld commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chimpera commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Test plan

Uh oh!

Anbeeld commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chimpera commented May 22, 2026 •

edited

Loading