Loading HQQ quantized models is broken since #35926 #37263

mobicham · 2025-04-03T19:54:55Z

System Info

transformers version: 4.51.0.dev0
Platform: Linux-5.4.0-208-generic-x86_64-with-glibc2.35
Python version: 3.11.10
Huggingface_hub version: 0.30.1
Safetensors version: 0.5.3
Accelerate version: 1.6.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:

Loading HQQ models is broken since #35926
Not sure what changed, probably something in modeling_utils
@SunMarc @ArthurZucker

Reproduction

import torch
compute_dtype = torch.bfloat16 
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    device_map="cuda",
)

AttributeError: `language_model.model.layers.46.self_attn.o_proj.quant_scale` is neither a parameter nor a buffer.

Expected behavior

HQQ quantized models were loading fine before #35926

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-04-04T12:38:42Z

cc @MekkCyber @SunMarc

Cyrilvallez · 2025-04-07T15:33:58Z

The fix is here #37347, it was indeed introduced when adding Deepseek!
Thanks a lot for the report! 🤗

mobicham · 2025-04-07T15:53:00Z

Thank you @Cyrilvallez ! Seems to work!
I'm not sure if it's related, but when I use Gemma3ForCausalLM instead of Gemma3ForConditionalGeneration, it just hangs indefinitely:

import torch
from transformers import Gemma3ForCausalLM, AutoProcessor
model = Gemma3ForCausalLM.from_pretrained(
    'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf',
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="cuda",
)

Cyrilvallez · 2025-04-07T16:39:17Z

It's because it's not a Gemma3ForCausalLM, it's a Gemma3ForConditionalGeneration! Though the hanging is surprising, I agree, and it happens at a weird place.
You can still use AutoModelForCausalLM, which will resolve the Gemma3ForConditionalGeneration correctly 😉

mobicham · 2025-04-09T16:35:07Z

Sorry, but there's still a problem loading hqq quantized model. I noticed that the ones that have a bias no longer load:

https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797

It is related to this old commit: 4b5cf54

The test would have failed if the test file was using a model that has a bias like facebook/opt-125m:

mobicham added the bug label Apr 3, 2025

Cyrilvallez mentioned this issue Apr 7, 2025

Remove HQQ from caching allocator warmup #37347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading HQQ quantized models is broken since #35926 #37263

Loading HQQ quantized models is broken since #35926 #37263

mobicham commented Apr 3, 2025 •

edited

Loading

Rocketknight1 commented Apr 4, 2025 •

edited

Loading

Cyrilvallez commented Apr 7, 2025

mobicham commented Apr 7, 2025

Cyrilvallez commented Apr 7, 2025

mobicham commented Apr 9, 2025 •

edited

Loading

Loading HQQ quantized models is broken since #35926 #37263

Loading HQQ quantized models is broken since #35926 #37263

Comments

mobicham commented Apr 3, 2025 • edited Loading

System Info

Reproduction

Expected behavior

Rocketknight1 commented Apr 4, 2025 • edited Loading

Cyrilvallez commented Apr 7, 2025

mobicham commented Apr 7, 2025

Cyrilvallez commented Apr 7, 2025

mobicham commented Apr 9, 2025 • edited Loading

mobicham commented Apr 3, 2025 •

edited

Loading

Rocketknight1 commented Apr 4, 2025 •

edited

Loading

mobicham commented Apr 9, 2025 •

edited

Loading