Skip to content

Misc. bug: llama-server speculative decoding not as performant as llama-speculative-simple #12968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hjc4869 opened this issue Apr 16, 2025 · 3 comments

Comments

@hjc4869
Copy link
Contributor

hjc4869 commented Apr 16, 2025

Name and Version

$ llama-cli --version
version: 5142 (80f19b4)
built with AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) for x86_64-unknown-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-speculative-simple -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -p "Repeat this sentence for 50 times: This is a test\n" --color -s 123 -n 260 -t 1 -fa -ctv q8_0 -ctk q8_0

llama-server -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -t 1 -fa -ctv q8_0 -ctk q8_0

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"max_tokens": 300, "messages": [{"role": "user", "content": "Repeat this sentence for 50 times: This is a test"}]}'

Problem description & steps to reproduce

When testing the best case scenario of speculative decoding (repeating the same content), llama-server generates less drafted tokens and performs worse than llama-speculative-simple. Using the same args to launch both programs and generate 261 tokens, we only see 154/156 drafted tokens in llama-server compared to 195 in llama-speculative-simple with similar acceptance rate (almost 100%). As a result, the performance of llama-server is around 10%-15% less than llama-speculative-simple.

ROCR_VISIBLE_DEVICES=1 ./build/bin/llama-server -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -t 1 -fa -ctv q8_0 -ctk q8_0

curl localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"max_tokens": 300, "messages": [{"role": "user", "content": "Repeat this sentence for 50 times: This is a test"}]}'

Responses (2 examples):
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767822,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-tojEY2jOFzBWbF2V0cHcPWMuryt4DOEI","timings":{"prompt_n":21,"prompt_ms":221.093,"prompt_per_token_ms":10.528238095238095,"prompt_per_second":94.9826543581208,"predicted_n":261,"predicted_ms":8471.726,"predicted_per_token_ms":32.45872030651341,"predicted_per_second":30.808361837953683,"draft_n":156,"draft_n_accepted":156}}

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767842,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-uVHH0019M7QOf9byfNr1STS2C1JceH2w","timings":{"prompt_n":1,"prompt_ms":70.998,"prompt_per_token_ms":70.998,"prompt_per_second":14.084903800107044,"predicted_n":261,"predicted_ms":8011.701,"predicted_per_token_ms":30.696172413793104,"predicted_per_second":32.57735155118744,"draft_n":159,"draft_n_accepted":154}}

ROCR_VISIBLE_DEVICES=1 ./build/bin/llama-speculative-simple -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -p "Repeat this sentence for 50 times: This is a test\n" --color -s 123 -n 260 -t 1 -fa -ctv q8_0 -ctk q8_0

n_draft = 3
n_predict = 261
n_drafted = 195
n_accept = 195
accept = 100.000%

First Bad Commit

No response

Relevant log output

Detailed logs of llama-server

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 5142 (80f19b41) with AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) for x86_64-unknown-linux-gnu
system info: n_threads = 1, n_threads_batch = 1, total_threads = 64

system_info: n_threads = 1 (n_threads_batch = 1) / 64 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 63
main: loading model
srv    load_model: loading model '/home/david/models/qwen2.5-72b-iq4xs.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon PRO W7900 Dual Slot ) - 49086 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/david/models/qwen2.5-72b-iq4xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 80
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 64
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_1:   10 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   70 tensors
llama_model_loader: - type iq4_xs:  401 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 37.40 GiB (4.42 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 29568
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 72.71 B
print_info: general.name     = Qwen2.5 72B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        ROCm0 model buffer size = 37665.82 MiB
load_tensors:   CPU_Mapped model buffer size =   631.12 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 512, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 80, can_shift = 1
init:      ROCm0 KV buffer size =    85.00 MiB
llama_context: KV self size  =   85.00 MiB, K (q8_0):   42.50 MiB, V (q8_0):   42.50 MiB
llama_context:      ROCm0 compute buffer size =   313.00 MiB
llama_context:  ROCm_Host compute buffer size =    17.01 MiB
llama_context: graph nodes  = 2647
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: loading draft model '/home/david/models/qwen2.5-0.5b-iq4xs.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon PRO W7900 Dual Slot ) - 10848 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 290 tensors from /home/david/models/qwen2.5-0.5b-iq4xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0.5B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_1:   24 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q5_K:    3 tensors
llama_model_loader: - type iq4_nl:  120 tensors
llama_model_loader: - type iq4_xs:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 329.49 MiB (5.59 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 494.03 M
print_info: general.name     = Qwen2.5 0.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        ROCm0 model buffer size =   329.52 MiB
load_tensors:   CPU_Mapped model buffer size =   137.94 MiB
...........................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =     6.00 MiB
llama_context: KV self size  =    6.00 MiB, K (f16):    3.00 MiB, V (f16):    3.00 MiB
llama_context:      ROCm0 compute buffer size =   298.50 MiB
llama_context:  ROCm_Host compute buffer size =     2.76 MiB
llama_context: graph nodes  = 799
llama_context: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =     6.00 MiB
llama_context: KV self size  =    6.00 MiB, K (f16):    3.00 MiB, V (f16):    3.00 MiB
llama_context:      ROCm0 compute buffer size =   298.50 MiB
llama_context:  ROCm_Host compute buffer size =     2.76 MiB
llama_context: graph nodes  = 799
llama_context: graph splits = 2
slot         init: id  0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 21
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 21, n_tokens = 21, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 21, n_tokens = 21
slot      release: id  0 | task 0 | stop processing: n_past = 281, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     221.09 ms /    21 tokens (   10.53 ms per token,    94.98 tokens per second)
       eval time =    8471.73 ms /   261 tokens (   32.46 ms per token,    30.81 tokens per second)
      total time =    8692.82 ms /   282 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 1.00000 (  156 accepted /   156 generated)
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 54 | processing task
slot update_slots: id  0 | task 54 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 21
slot update_slots: id  0 | task 54 | need to evaluate at least 1 token to generate logits, n_past = 21, n_prompt_tokens = 21
slot update_slots: id  0 | task 54 | kv cache rm [20, end)
slot update_slots: id  0 | task 54 | prompt processing progress, n_past = 21, n_tokens = 1, progress = 0.047619
slot update_slots: id  0 | task 54 | prompt done, n_past = 21, n_tokens = 1
slot      release: id  0 | task 54 | stop processing: n_past = 281, truncated = 0
slot print_timing: id  0 | task 54 | 
prompt eval time =      71.00 ms /     1 tokens (   71.00 ms per token,    14.08 tokens per second)
       eval time =    8011.70 ms /   261 tokens (   30.70 ms per token,    32.58 tokens per second)
      total time =    8082.70 ms /   262 tokens
slot print_timing: id  0 | task 54 | 
draft acceptance rate = 0.96855 (  154 accepted /   159 generated)
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Detailed logs of llama-speculative-simple

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon PRO W7900 Dual Slot , gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 5142 (80f19b41) with AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) for x86_64-unknown-linux-gnu
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon PRO W7900 Dual Slot ) - 49086 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/david/models/qwen2.5-72b-iq4xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 80
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 64
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_1:   10 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   70 tensors
llama_model_loader: - type iq4_xs:  401 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 37.40 GiB (4.42 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 29568
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 72.71 B
print_info: general.name     = Qwen2.5 72B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        ROCm0 model buffer size = 37665.82 MiB
load_tensors:   CPU_Mapped model buffer size =   631.12 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 512, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 80, can_shift = 1
init:      ROCm0 KV buffer size =    85.00 MiB
llama_context: KV self size  =   85.00 MiB, K (q8_0):   42.50 MiB, V (q8_0):   42.50 MiB
llama_context:      ROCm0 compute buffer size =   313.00 MiB
llama_context:  ROCm_Host compute buffer size =    17.01 MiB
llama_context: graph nodes  = 2647
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon PRO W7900 Dual Slot ) - 10848 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 290 tensors from /home/david/models/qwen2.5-0.5b-iq4xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0.5B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 30
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_1:   24 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q5_K:    3 tensors
llama_model_loader: - type iq4_nl:  120 tensors
llama_model_loader: - type iq4_xs:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 329.49 MiB (5.59 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 494.03 M
print_info: general.name     = Qwen2.5 0.5B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        ROCm0 model buffer size =   329.52 MiB
load_tensors:   CPU_Mapped model buffer size =   137.94 MiB
...........................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 512, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =     3.19 MiB
llama_context: KV self size  =    3.19 MiB, K (q8_0):    1.59 MiB, V (q8_0):    1.59 MiB
llama_context:      ROCm0 compute buffer size =   300.25 MiB
llama_context:  ROCm_Host compute buffer size =     4.51 MiB
llama_context: graph nodes  = 799
llama_context: graph splits = 50
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)


Repeat this sentence for 50 times: This is a test
This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This

encoded   14 tokens in    0.012 seconds, speed: 1163.370 t/s
decoded  261 tokens in    7.355 seconds, speed:   35.484 t/s

n_draft   = 3
n_predict = 261
n_drafted = 195
n_accept  = 195
accept    = 100.000%

draft:

llama_perf_context_print:        load time =     161.07 ms
llama_perf_context_print: prompt eval time =    1006.70 ms /   142 tokens (    7.09 ms per token,   141.05 tokens per second)
llama_perf_context_print:        eval time =     943.49 ms /   131 runs   (    7.20 ms per token,   138.85 tokens per second)
llama_perf_context_print:       total time =    7369.47 ms /   273 tokens

target:

llama_perf_sampler_print:    sampling time =      12.24 ms /   261 runs   (    0.05 ms per token, 21327.01 tokens per second)
llama_perf_context_print:        load time =    6076.40 ms
llama_perf_context_print: prompt eval time =    5327.03 ms /   273 tokens (   19.51 ms per token,    51.25 tokens per second)
llama_perf_context_print:        eval time =      64.45 ms /     1 runs   (   64.45 ms per token,    15.52 tokens per second)
llama_perf_context_print:       total time =    7530.74 ms /   274 tokens
@ggerganov
Copy link
Member

If you enable verbose output with -lv 1 can you figure out what is causing the discrepancy?

@hjc4869
Copy link
Contributor Author

hjc4869 commented Apr 16, 2025

Collected the verbose logs from both. It seems that llama-speculative-simple keeps invoking common_speculative_gen_draft in a loop, while llama-server always do decoding batch, n_tokens = 1 first to generate a single token without speculative decoding, then call common_speculative_gen_draft to generate another 4, then goes back to generate a single token again.

Not sure if this is the expected behavior of llama-server as its code is a lot more complex than llama-speculative-simple, but the different behavior matches what I read from server.cpp and speculative-simple.cpp

Attached log files:
llama-server.log
llama-speculative-simple.log

@ggerganov
Copy link
Member

Right, the server implementation can be improved to do it more efficiently as the simple example. For now it was easier to do it like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants