Compiled 2.0 release with Cuda compilation tools, release 13.2, V13.2.78 Build cuda_13.2.r13.2/compiler.37668154_0
Tried exactly as docs says (because using single RTX3090 24GB) on ubuntu revolute:
- Qwen3.6-27B-Q5_K_S.gguf
- Qwen3.6-27B-DFlash-Q4_K_M.gguf
- mmproj-BF16.gguf (but this one does not matter much)
Full cmdline (with some tweaks already, initially tried suggested temp):
beellama.cpp/build/bin/llama-server \
-m "/home/models/llm/beelama/Qwen3.6-27B-Q5_K_S.gguf" \
--mmproj "/home/models/llm/beelama/mmproj-BF16.gguf" \
--no-mmproj-offload \
--spec-draft-model "/home/models/llm/beelama/Qwen3.6-27B-DFlash-Q4_K_M.gguf" \
--spec-type dflash \
--spec-dflash-cross-ctx 1024 \
--port 8082 \
--host 0.0.0.0 \
-np 1 \
--kv-unified \
-ngl all \
--spec-draft-ngl all \
-b 2048 -ub 512 \
--ctx-size 134000 \
--cache-type-k q5_0 --cache-type-v q4_1 \
--flash-attn on \
--cache-ram 0 \
--jinja \
--no-mmap --mlock \
--no-host \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.4 --top-k 20 --top-p 1.0 --min-p 0.0
What i get - very low acceptance rate (if i understand statistics correctly) - lower than 20% in all cases.
Then i tried to reduce temp to zero, increase top-k, still:
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task -1 | spec dm controller: adaptive=1 controller=profit probe_fraction=0.2500 explore_interval=12 profit_min=0.0500 raise=0.0500 lower=0.0500 ewma=0.1500 min_samples=3 warmup=0 baseline_interval=1024 p_min=0.0000 draft_p_min=0.0000
slot launch_slot_: id 0 | task -1 | request sampling: n_predict=-1 ignore_eos=0 stop=0 reasoning_budget=-1 temp=0.000 dynatemp_range=0.000 dynatemp_exponent=1.000 top_k=1000 top_p=1.000 min_p=0.000 typ_p=1.000 top_n_sigma=-1.000 mirostat=0 samplers=penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature reasoning_loop_guard=1 loop_min=1024 loop_window=2048 loop_period=512 loop_coverage=768 loop_interval=32 loop_interventions=1 repeat_penalty=1.000 presence_penalty=0.000 frequency_penalty=0.000 dry_multiplier=0.000 dry_allowed_length=2 adaptive_n_max=14
<...>
slot operator(): id 0 | task 9724 | adaptive dm profit: cur=0 recommended=14 score=24.9 action=apply
reasoning-budget: deactivated (natural end)
slot operator(): id 0 | task 9724 | adaptive dm profit: cur=14 recommended=12 score=44.6 action=apply
slot print_timing: id 0 | task 9724 |
prompt eval time = 1449.04 ms / 910 tokens ( 1.59 ms per token, 628.00 tokens per second)
eval time = 79974.83 ms / 3146 tokens ( 25.42 ms per token, 39.34 tokens per second)
total time = 81423.87 ms / 4056 tokens
draft acceptance rate = 0.14337 ( 2088 accepted / 14564 generated)
adaptive dm: fringe=0.00 n_max=12
statistics dflash: #calls(b,g,a) = 12 10703 8043, #gen drafts = 10703, #acc drafts = 8043, #gen tokens = 128319, #acc tokens = 22314, dur(b,g,a) = 0.014, 93212.216, 0.471 ms
Speed is not bad (i have 250W power cap) but i kinda expect better acceptance rate.
I tried different prompts - from "daily stuff" to "fix this yaml" or "write script..." - acceptance rate varies slightly, but usually between 12 and 18%.
ToDo (to test) - increase top-k to 10K or so, increase --spec-dflash-cross-ctx - but better to know if settings should be changed between versions (and provided defaults are no longer correct)?
Compiled 2.0 release with Cuda compilation tools, release 13.2, V13.2.78 Build cuda_13.2.r13.2/compiler.37668154_0
Tried exactly as docs says (because using single RTX3090 24GB) on ubuntu revolute:
Full cmdline (with some tweaks already, initially tried suggested temp):
What i get - very low acceptance rate (if i understand statistics correctly) - lower than 20% in all cases.
Then i tried to reduce temp to zero, increase top-k, still:
Speed is not bad (i have 250W power cap) but i kinda expect better acceptance rate.
I tried different prompts - from "daily stuff" to "fix this yaml" or "write script..." - acceptance rate varies slightly, but usually between 12 and 18%.
ToDo (to test) - increase top-k to 10K or so, increase --spec-dflash-cross-ctx - but better to know if settings should be changed between versions (and provided defaults are no longer correct)?