Block-wise Quantization Not supported

_Originally reported in [#1464](https://github.com/vllm-project/llm-compressor/issues/1464)_

The `Observer` class and compressed-tensor do not currently have an implementation for BLOCK-wise quantization (see code starting [here](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/observers/base.py#L90)) 


Here is a minimal reproducible example:

```python
from transformers import AutoModelForCausalLM

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationConfig,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

# define a llmcompressor recipe for FP8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
    QuantizationModifier(
        ignore=["lm_head", "re:.*mlp.gate$"],
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.BLOCK,
                    # group_size=128,
                    block_structure="128x128",
                ),
            )
        },
    )
]

SAVE_DIR = MODEL_ID + "-W4A16-BLOCK128"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="bfloat16", trust_remote_code=True
)


oneshot(
    model=model,
    recipe=recipe,
    save_compressed=True,
    output_dir=SAVE_DIR,
)
```

We will probably want to support block-wise quantization moving forward, but this is probably low priority. As far as I know, vllm doesn't have optimized kernels to run block-wise quantized models, unlike group-wise quantization. I will create a PR to reference this issue and raise an error if block-wise is used, to try group-wise instead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block-wise Quantization Not supported #1475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Block-wise Quantization Not supported #1475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions