Slow token generation speed of Gemma 3 QAT Models #13048

shahizat · 2025-04-21T13:45:18Z

Hello,

Could you please advise on how to correctly run llama.cpp with tensor parallelism set to 4 and full GPU support? I have four GPUs with 48 GB of VRAM each.

I am using the following command:

build/bin/llama-server \
  --model ~/.cache/huggingface/hub/models--google--gemma-3-27b-it-qat-q4_0-gguf/snapshots/17cf0f6ad611f1a57a1640daa57eb427d6e67ed6/gemma-3-27b-it-q4_0.gguf \
  --threads 16 \
  --prio 2 \
  --temp 0.6 \
  --ctx-size 4096 \
  --seed 3407 \
  --n-gpu-layers 256 \
  -ngl 1028 \
  --port 8080

For speculative decoding, I am using the below command:

build/bin/llama-server \
  -m ~/.cache/huggingface/hub/models--google--gemma-3-27b-it-qat-q4_0-gguf/snapshots/17cf0f6ad611f1a57a1640daa57eb427d6e67ed6/gemma-3-27b-it-q4_0.gguf \
  -md /home/admin2/.cache/huggingface/hub/models--google--gemma-3-1b-it-qat-q4_0-gguf/snapshots/d1be121d36172a4b0b964657e2ee859d61138593/gemma-3-1b-it-q4_0.gguf \
  -c 4096 \
  -cd 4096 \
  -ngl 99 \
  -ngld 99 \
  --draft-max 8 \
  --draft-min 4 \
  --draft-p-min 0.9 \
  --host 0.0.0.0 \
  --port 8080

The text was updated successfully, but these errors were encountered:

betweenus · 2025-04-21T17:01:29Z

Hi. This may be related to #12968 . And also 1b QAT models may be broken. It is recommended to try ggufs from bartowski.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow token generation speed of Gemma 3 QAT Models #13048

Slow token generation speed of Gemma 3 QAT Models #13048

shahizat commented Apr 21, 2025

betweenus commented Apr 21, 2025

Slow token generation speed of Gemma 3 QAT Models #13048

Slow token generation speed of Gemma 3 QAT Models #13048

Comments

shahizat commented Apr 21, 2025

betweenus commented Apr 21, 2025