Skip to content

Block-wise Quantization Not supported #1475

Open
@brian-dellabetta

Description

@brian-dellabetta

Originally reported in #1464

The Observer class and compressed-tensor do not currently have an implementation for BLOCK-wise quantization (see code starting here)

Here is a minimal reproducible example:

from transformers import AutoModelForCausalLM

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationConfig,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

# define a llmcompressor recipe for FP8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
    QuantizationModifier(
        ignore=["lm_head", "re:.*mlp.gate$"],
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.BLOCK,
                    # group_size=128,
                    block_structure="128x128",
                ),
            )
        },
    )
]

SAVE_DIR = MODEL_ID + "-W4A16-BLOCK128"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="bfloat16", trust_remote_code=True
)


oneshot(
    model=model,
    recipe=recipe,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

We will probably want to support block-wise quantization moving forward, but this is probably low priority. As far as I know, vllm doesn't have optimized kernels to run block-wise quantized models, unlike group-wise quantization. I will create a PR to reference this issue and raise an error if block-wise is used, to try group-wise instead

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcompressed-tensorsRelates to compressed-tensorsgood follow-up issueA good issue for users with some familiarity of the codebase

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions