Open
Description
Originally reported in #1464
The Observer
class and compressed-tensor do not currently have an implementation for BLOCK-wise quantization (see code starting here)
Here is a minimal reproducible example:
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationConfig,
QuantizationScheme,
QuantizationStrategy,
QuantizationType,
)
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
# define a llmcompressor recipe for FP8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
QuantizationModifier(
ignore=["lm_head", "re:.*mlp.gate$"],
config_groups={
"group_0": QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
dynamic=False,
symmetric=False,
strategy=QuantizationStrategy.BLOCK,
# group_size=128,
block_structure="128x128",
),
)
},
)
]
SAVE_DIR = MODEL_ID + "-W4A16-BLOCK128"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="bfloat16", trust_remote_code=True
)
oneshot(
model=model,
recipe=recipe,
save_compressed=True,
output_dir=SAVE_DIR,
)
We will probably want to support block-wise quantization moving forward, but this is probably low priority. As far as I know, vllm doesn't have optimized kernels to run block-wise quantized models, unlike group-wise quantization. I will create a PR to reference this issue and raise an error if block-wise is used, to try group-wise instead