Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading HQQ quantized models is broken since #35926 #37263

Open
mobicham opened this issue Apr 3, 2025 · 5 comments
Open

Loading HQQ quantized models is broken since #35926 #37263

mobicham opened this issue Apr 3, 2025 · 5 comments
Labels

Comments

@mobicham
Copy link
Contributor

mobicham commented Apr 3, 2025

System Info

  • transformers version: 4.51.0.dev0
  • Platform: Linux-5.4.0-208-generic-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • Huggingface_hub version: 0.30.1
  • Safetensors version: 0.5.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:

Loading HQQ models is broken since #35926
Not sure what changed, probably something in modeling_utils
@SunMarc @ArthurZucker

Reproduction

import torch
compute_dtype = torch.bfloat16 
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    device_map="cuda",
)
AttributeError: `language_model.model.layers.46.self_attn.o_proj.quant_scale` is neither a parameter nor a buffer.

Expected behavior

HQQ quantized models were loading fine before #35926

@mobicham mobicham added the bug label Apr 3, 2025
@Rocketknight1
Copy link
Member

Rocketknight1 commented Apr 4, 2025

cc @MekkCyber @SunMarc

@Cyrilvallez
Copy link
Member

The fix is here #37347, it was indeed introduced when adding Deepseek!
Thanks a lot for the report! 🤗

@mobicham
Copy link
Contributor Author

mobicham commented Apr 7, 2025

Thank you @Cyrilvallez ! Seems to work!
I'm not sure if it's related, but when I use Gemma3ForCausalLM instead of Gemma3ForConditionalGeneration, it just hangs indefinitely:

import torch
from transformers import Gemma3ForCausalLM, AutoProcessor
model = Gemma3ForCausalLM.from_pretrained(
    'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf',
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="cuda",
)

@Cyrilvallez
Copy link
Member

It's because it's not a Gemma3ForCausalLM, it's a Gemma3ForConditionalGeneration! Though the hanging is surprising, I agree, and it happens at a weird place.
You can still use AutoModelForCausalLM, which will resolve the Gemma3ForConditionalGeneration correctly 😉

@mobicham
Copy link
Contributor Author

mobicham commented Apr 9, 2025

Sorry, but there's still a problem loading hqq quantized model. I noticed that the ones that have a bias no longer load:

https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797

It is related to this old commit: 4b5cf54

The test would have failed if the test file was using a model that has a bias like facebook/opt-125m:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants