You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ llama-cli --version
version: 5142 (80f19b4)
built with AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) for x86_64-unknown-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-speculative-simple -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -p "Repeat this sentence for 50 times: This is a test\n" --color -s 123 -n 260 -t 1 -fa -ctv q8_0 -ctk q8_0
llama-server -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -t 1 -fa -ctv q8_0 -ctk q8_0
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"max_tokens": 300, "messages": [{"role": "user", "content": "Repeat this sentence for 50 times: This is a test"}]}'
Problem description & steps to reproduce
When testing the best case scenario of speculative decoding (repeating the same content), llama-server generates less drafted tokens and performs worse than llama-speculative-simple. Using the same args to launch both programs and generate 261 tokens, we only see 154/156 drafted tokens in llama-server compared to 195 in llama-speculative-simple with similar acceptance rate (almost 100%). As a result, the performance of llama-server is around 10%-15% less than llama-speculative-simple.
curl localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"max_tokens": 300, "messages": [{"role": "user", "content": "Repeat this sentence for 50 times: This is a test"}]}'
Responses (2 examples):
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767822,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-tojEY2jOFzBWbF2V0cHcPWMuryt4DOEI","timings":{"prompt_n":21,"prompt_ms":221.093,"prompt_per_token_ms":10.528238095238095,"prompt_per_second":94.9826543581208,"predicted_n":261,"predicted_ms":8471.726,"predicted_per_token_ms":32.45872030651341,"predicted_per_second":30.808361837953683,"draft_n":156,"draft_n_accepted":156}}
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767842,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-uVHH0019M7QOf9byfNr1STS2C1JceH2w","timings":{"prompt_n":1,"prompt_ms":70.998,"prompt_per_token_ms":70.998,"prompt_per_second":14.084903800107044,"predicted_n":261,"predicted_ms":8011.701,"predicted_per_token_ms":30.696172413793104,"predicted_per_second":32.57735155118744,"draft_n":159,"draft_n_accepted":154}}
ROCR_VISIBLE_DEVICES=1 ./build/bin/llama-speculative-simple -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -p "Repeat this sentence for 50 times: This is a test\n" --color -s 123 -n 260 -t 1 -fa -ctv q8_0 -ctk q8_0
Collected the verbose logs from both. It seems that llama-speculative-simple keeps invoking common_speculative_gen_draft in a loop, while llama-server always do decoding batch, n_tokens = 1 first to generate a single token without speculative decoding, then call common_speculative_gen_draft to generate another 4, then goes back to generate a single token again.
Not sure if this is the expected behavior of llama-server as its code is a lot more complex than llama-speculative-simple, but the different behavior matches what I read from server.cpp and speculative-simple.cpp
Name and Version
$ llama-cli --version
version: 5142 (80f19b4)
built with AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24) for x86_64-unknown-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
When testing the best case scenario of speculative decoding (repeating the same content), llama-server generates less drafted tokens and performs worse than llama-speculative-simple. Using the same args to launch both programs and generate 261 tokens, we only see 154/156 drafted tokens in llama-server compared to 195 in llama-speculative-simple with similar acceptance rate (almost 100%). As a result, the performance of llama-server is around 10%-15% less than llama-speculative-simple.
ROCR_VISIBLE_DEVICES=1 ./build/bin/llama-server -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -t 1 -fa -ctv q8_0 -ctk q8_0
curl localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"max_tokens": 300, "messages": [{"role": "user", "content": "Repeat this sentence for 50 times: This is a test"}]}'
Responses (2 examples):
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767822,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-tojEY2jOFzBWbF2V0cHcPWMuryt4DOEI","timings":{"prompt_n":21,"prompt_ms":221.093,"prompt_per_token_ms":10.528238095238095,"prompt_per_second":94.9826543581208,"predicted_n":261,"predicted_ms":8471.726,"predicted_per_token_ms":32.45872030651341,"predicted_per_second":30.808361837953683,"draft_n":156,"draft_n_accepted":156}}
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test."}}],"created":1744767842,"model":"gpt-3.5-turbo","system_fingerprint":"b5142-80f19b41","object":"chat.completion","usage":{"completion_tokens":261,"prompt_tokens":21,"total_tokens":282},"id":"chatcmpl-uVHH0019M7QOf9byfNr1STS2C1JceH2w","timings":{"prompt_n":1,"prompt_ms":70.998,"prompt_per_token_ms":70.998,"prompt_per_second":14.084903800107044,"predicted_n":261,"predicted_ms":8011.701,"predicted_per_token_ms":30.696172413793104,"predicted_per_second":32.57735155118744,"draft_n":159,"draft_n_accepted":154}}
ROCR_VISIBLE_DEVICES=1 ./build/bin/llama-speculative-simple -c 512 -cd 512 -m ~/models/qwen2.5-72b-iq4xs.gguf -md ~/models/qwen2.5-0.5b-iq4xs.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -p "Repeat this sentence for 50 times: This is a test\n" --color -s 123 -n 260 -t 1 -fa -ctv q8_0 -ctk q8_0
n_draft = 3
n_predict = 261
n_drafted = 195
n_accept = 195
accept = 100.000%
First Bad Commit
No response
Relevant log output
Detailed logs of llama-server
Detailed logs of llama-speculative-simple
The text was updated successfully, but these errors were encountered: