Name and Version
$ ./llama-server --version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 23822 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 11910 MiB
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 11911 MiB
version: 9459 (07ac3cec6)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:04:00.0 Off | N/A |
| 71% 52C P2 34W / 170W | 11482MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3080 Off | 00000000:07:00.0 On | N/A |
| 68% 57C P2 118W / 350W | 11067MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1165 G /usr/lib/Xorg 4MiB |
| 0 N/A N/A 1955 C+G /usr/bin/walker 92MiB |
| 0 N/A N/A 10440 C+G ...rack-uuid=3190708988185955192 4MiB |
| 0 N/A N/A 30847 C ...ma.cpp/build/bin/llama-server 11332MiB |
| 1 N/A N/A 1165 G /usr/lib/Xorg 36MiB |
| 1 N/A N/A 1747 G Hyprland 233MiB |
| 1 N/A N/A 1889 G Xwayland 4MiB |
| 1 N/A N/A 2316 G alacritty 54MiB |
| 1 N/A N/A 10440 G ...rack-uuid=3190708988185955192 118MiB |
| 1 N/A N/A 30847 C ...ma.cpp/build/bin/llama-server 10396MiB |
+-----------------------------------------------------------------------------------------+
Models
unsloth Qwen3.6-27B
Problem description & steps to reproduce
Ping the model
# llama-swap config
bee_qwen3.6-27b:
ttl: 1800 # Auto-unload after 5 minutes of inactivity
cmd: >
/home/fred/workspaces/ai/beellama.cpp/build/bin/llama-server
--port ${PORT}
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_S
--spec-draft-model "/home/fred/.cache/huggingface/hub/Qwen3.6-27B-DFlash-Q4_K_M.gguf"
--spec-type dflash
--spec-draft-ngl all
--jinja
--flash-attn on
--no-mmproj
--no-mmap
--mlock
--fit-target 512
--cache-type-k turbo4 --cache-type-v turbo3_tcq
--parallel 1
--kv-unified
--ctx-size 32000
filters:
stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repetition_penalty"
setParamsByID:
"${MODEL_ID}:thinking":
chat_template_kwargs:
enable_thinking: true
preserve_thinking: true
reasoning_budget: 4096
temperature: 1.0
top_p: 0.95
top_k: 20
min_p: 0.05
presence_penalty: 1.5
repetition_penalty: 1.0
"${MODEL_ID}:thinking-coding":
chat_template_kwargs:
enable_thinking: true
preserve_thinking: true
reasoning_budget: 4096
temperature: 0.6
top_p: 0.95
top_k: 20
min_p: 0.0
presence_penalty: 0.0
repetition_penalty: 1.0
"${MODEL_ID}:instruct":
chat_template_kwargs:
enable_thinking: false
preserve_thinking: false
reasoning_budget: 4096
temperature: 0.7
top_p: 0.8
top_k: 20
min_p: 0.0
presence_penalty: 1.5
repetition_penalty: 1.0
"${MODEL_ID}:instruct-reasoning":
chat_template_kwargs:
enable_thinking: false
preserve_thinking: false
reasoning_budget: 4096
temperature: 1.0
top_p: 0.95
top_k: 20
min_p: 0.0
presence_penalty: 1.5
repetition_penalty: 1.0
First Bad Commit
No response
Relevant log output
Logs
- the tokens for sequence 0 in the input batch have a starting position of Y = 1360
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 259
- the tokens for sequence 0 in the input batch have a starting position of Y = 1361
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 259
- the tokens for sequence 0 in the input batch have a starting position of Y = 1362
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
Name and Version
$ ./llama-server --version ggml_cuda_init: found 2 CUDA devices (Total VRAM: 23822 MiB): Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 11910 MiB Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 11911 MiB version: 9459 (07ac3cec6) built with GNU 15.2.1 for Linux x86_64Operating systems
Linux
GGML backends
CUDA
Hardware
Models
unsloth Qwen3.6-27B
Problem description & steps to reproduce
Ping the model
First Bad Commit
No response
Relevant log output
Logs