Skip to content

Slow token generation speed of Gemma 3 QAT Models #13048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shahizat opened this issue Apr 21, 2025 · 1 comment
Open

Slow token generation speed of Gemma 3 QAT Models #13048

shahizat opened this issue Apr 21, 2025 · 1 comment

Comments

@shahizat
Copy link

Hello,

Could you please advise on how to correctly run llama.cpp with tensor parallelism set to 4 and full GPU support? I have four GPUs with 48 GB of VRAM each.

I am using the following command:

build/bin/llama-server \
  --model ~/.cache/huggingface/hub/models--google--gemma-3-27b-it-qat-q4_0-gguf/snapshots/17cf0f6ad611f1a57a1640daa57eb427d6e67ed6/gemma-3-27b-it-q4_0.gguf \
  --threads 16 \
  --prio 2 \
  --temp 0.6 \
  --ctx-size 4096 \
  --seed 3407 \
  --n-gpu-layers 256 \
  -ngl 1028 \
  --port 8080

For speculative decoding, I am using the below command:

build/bin/llama-server \
  -m ~/.cache/huggingface/hub/models--google--gemma-3-27b-it-qat-q4_0-gguf/snapshots/17cf0f6ad611f1a57a1640daa57eb427d6e67ed6/gemma-3-27b-it-q4_0.gguf \
  -md /home/admin2/.cache/huggingface/hub/models--google--gemma-3-1b-it-qat-q4_0-gguf/snapshots/d1be121d36172a4b0b964657e2ee859d61138593/gemma-3-1b-it-q4_0.gguf \
  -c 4096 \
  -cd 4096 \
  -ngl 99 \
  -ngld 99 \
  --draft-max 8 \
  --draft-min 4 \
  --draft-p-min 0.9 \
  --host 0.0.0.0 \
  --port 8080
@betweenus
Copy link

Hi. This may be related to #12968 . And also 1b QAT models may be broken. It is recommended to try ggufs from bartowski.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants