Skip to content

Acceptance rate on 2.0 is very low #26

@ABLomas

Description

@ABLomas

Compiled 2.0 release with Cuda compilation tools, release 13.2, V13.2.78 Build cuda_13.2.r13.2/compiler.37668154_0
Tried exactly as docs says (because using single RTX3090 24GB) on ubuntu revolute:

  • Qwen3.6-27B-Q5_K_S.gguf
  • Qwen3.6-27B-DFlash-Q4_K_M.gguf
  • mmproj-BF16.gguf (but this one does not matter much)

Full cmdline (with some tweaks already, initially tried suggested temp):

beellama.cpp/build/bin/llama-server \
  -m "/home/models/llm/beelama/Qwen3.6-27B-Q5_K_S.gguf" \
  --mmproj "/home/models/llm/beelama/mmproj-BF16.gguf" \
  --no-mmproj-offload \
  --spec-draft-model "/home/models/llm/beelama/Qwen3.6-27B-DFlash-Q4_K_M.gguf" \
  --spec-type dflash \
  --spec-dflash-cross-ctx 1024 \
  --port 8082 \
  --host 0.0.0.0 \
  -np 1 \
  --kv-unified \
  -ngl all \
  --spec-draft-ngl all \
  -b 2048 -ub 512 \
  --ctx-size 134000 \
  --cache-type-k q5_0 --cache-type-v q4_1 \
  --flash-attn on \
  --cache-ram 0 \
  --jinja \
  --no-mmap --mlock \
  --no-host \
  --reasoning on \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --temp 0.4 --top-k 20 --top-p 1.0 --min-p 0.0

What i get - very low acceptance rate (if i understand statistics correctly) - lower than 20% in all cases.
Then i tried to reduce temp to zero, increase top-k, still:

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task -1 | spec dm controller: adaptive=1 controller=profit probe_fraction=0.2500 explore_interval=12 profit_min=0.0500 raise=0.0500 lower=0.0500 ewma=0.1500 min_samples=3 warmup=0 baseline_interval=1024 p_min=0.0000 draft_p_min=0.0000
slot launch_slot_: id  0 | task -1 | request sampling: n_predict=-1 ignore_eos=0 stop=0 reasoning_budget=-1 temp=0.000 dynatemp_range=0.000 dynatemp_exponent=1.000 top_k=1000 top_p=1.000 min_p=0.000 typ_p=1.000 top_n_sigma=-1.000 mirostat=0 samplers=penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature reasoning_loop_guard=1 loop_min=1024 loop_window=2048 loop_period=512 loop_coverage=768 loop_interval=32 loop_interventions=1 repeat_penalty=1.000 presence_penalty=0.000 frequency_penalty=0.000 dry_multiplier=0.000 dry_allowed_length=2 adaptive_n_max=14
<...>
slot   operator(): id  0 | task 9724 | adaptive dm profit: cur=0 recommended=14 score=24.9 action=apply
reasoning-budget: deactivated (natural end)
slot   operator(): id  0 | task 9724 | adaptive dm profit: cur=14 recommended=12 score=44.6 action=apply
slot print_timing: id  0 | task 9724 |
prompt eval time =    1449.04 ms /   910 tokens (    1.59 ms per token,   628.00 tokens per second)
       eval time =   79974.83 ms /  3146 tokens (   25.42 ms per token,    39.34 tokens per second)
      total time =   81423.87 ms /  4056 tokens
draft acceptance rate = 0.14337 ( 2088 accepted / 14564 generated)
adaptive dm: fringe=0.00 n_max=12
statistics dflash: #calls(b,g,a) = 12 10703 8043, #gen drafts = 10703, #acc drafts = 8043, #gen tokens = 128319, #acc tokens = 22314, dur(b,g,a) = 0.014, 93212.216, 0.471 ms

Speed is not bad (i have 250W power cap) but i kinda expect better acceptance rate.
I tried different prompts - from "daily stuff" to "fix this yaml" or "write script..." - acceptance rate varies slightly, but usually between 12 and 18%.
ToDo (to test) - increase top-k to 10K or so, increase --spec-dflash-cross-ctx - but better to know if settings should be changed between versions (and provided defaults are no longer correct)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions