Skip to content

[Research] Llama4 AutoWrapper + Onloading #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

No description provided.

kylesayrs and others added 30 commits February 24, 2025 14:47
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
SUMMARY:
Consolidate all build configuration into `setup.py`. The current split
between `pyproject.toml` and `setup.py` seems to cause some kind of race
condition/unpredictable behavior with the tooling regarding whether it
will honor the version functions defined in `setup.py`.

TEST PLAN:
Relevant changes are identical to those in
neuralmagic/compressed-tensors#304; and a build
is produced internally here:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14732959015

Signed-off-by: Domenic Barbuzzi <[email protected]>
## Background ##
The current KV cache tests are silently failing because Qmod(kv) +
GPTQ(weights) is not a supported recipe. This is because GPTQ was being
entirely skipped (because it was being preceded by a quantization
modifier with no weight schemes).

If you attempt to fix this by disallowing multi-qconfig recipes, you run
into the issue that model compression with KV+weights is not supported.

## Multi-round quantization ##

Previously, the model did not have weight quantization. This means that
there was no compressor. This means that the model would be saved in the
frozen status, not the compressed status. This would mean that the model
would be loaded, and the inferred status would be None, which means that
at load time, the status would move from none to frozen, hence passing
through initialization.

After fixing the recipe to run GPTQ with kv+weight quantization, you run
into another issue.

## KV + Weight Compression ##

Now, the model has weight quantization, meaning there is a compressor.
This means that the model will be saved in the compressed status. This
means that the model would be loaded with CompressedLinear (which are in
the frozen status), causing the whole model to be inferred as
compressed. Because the model is already supposedly in the compressed
status, then initialization does not happen and the kv_cache attention
parameters are not initialized, and hence the model fails to load
kv_cache qparams from the state dict.

## Ideal Solution ##

Ideally, we should be replacing with CompressedLinear as a part of
apply_quantization_status, not before applying quantization status.
Doing it the other way unintentionally skips all the lifecycle steps

As a side note, ideally, we should always be saving final models with
the compressed status, even if the compressor is dense. [structure is
initialized, checkpoints are calibration or frozen, and final models are
compressed (even if nothing was applied, so long as save_compressed)],
but this is out of scope of this fix.

---------

Signed-off-by: Kyle Sayers <[email protected]>
SUMMARY:
Drop the skip related to requiring `flash_attn` be installed in the
tests for the `quantizing_moe` examples. Recent CI failures related to
this package and CUDA compatibility with the recently released PyTorch
2.7.0 has resulted in findings that it is not required for these tests.

TEST PLAN:
An [internal test run][1] that drops the installation of `flash-attn`
and runs the changes on this branch indicates that the tests will pass
(one successful so far, will mark PR as ready once the run completes and
the remaining show expected results).

Specific relevant output (will update with other tests’ results):
```
tests/examples/test_quantizing_moe.py::TestQuantizingMOE::test_deepseek_example_script[deepseek_moe_w8a8_int8.py] PASSED

tests/examples/test_quantizing_moe.py::TestQuantizingMOE::test_deepseek_example_script[deepseek_moe_w8a8_fp8.py] PASSED
```

[1]:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14712618904

Signed-off-by: Domenic Barbuzzi <[email protected]>
…t for gpu from gha (#1264)

## Purpose ##
* Update all tests to use `requires_gpu` decorator
* Add GPU mark skip for `test_compressor_stacking`, which requires a GPU
* Add an explicit GPU test for GHA, so as to unambiguously catch
situations where CUDA is not properly installed on a runner

---------

Signed-off-by: Kyle Sayers <[email protected]>
## Purpose ## 
* Abstract functionality which allows modifiers to act as quantization
configs into a mixin called `QuantizationMixin`
* This gives #1279 an interface to properly infer which pipeline to use
based on the recipe (if a recipe contains modifiers requires
calibration, then use the "basic" or "sequential" pipelines)
* This enables future modifiers to act as quantization modifiers (in the
same way that GPTQ does now)
* Related to #1354 where previous logic would attempt to add a
QuantizedKVCache for dynamic kv_quant

## Changes ##
* Implement `QuantizationMixin` which implements five public methods
  * Lifecycle methods
* `initialize_quantization` is used to apply a config and attach
observers to a model
* quantization is disabled so that modules aren't quantized before
they're calibrated
* `start_calibration` is used to initialize calibration hooks and status
* quantization is enabled, since we currently quantize as we calibrate,
although this decision is somewhat arbitrary
* `end_calibration` is used to remove calibration hooks and apply the
frozen status
* quantization remains enabled, since we want future forward passes to
simulate quantization
  * Recipe-related methods
* `has_config` returns true if a config was specified, used for checking
against duplicate configs in the recipe
* `resolve_quantization_config` returns the quantization config
specified by the modifier fields
* `QuantizationModifier` inherits from `QuantizationMixin`
* `GPTQModifier` inherits from `QuantizationMixin`
* Unlike QMod, GPTQ disables quantization during calibration. As noted
before, this is a somewhat arbitrary choice but one which matches the
current implementation

* Calibration utils
* Replace `set_unset_kv_cache` with `initialize_quantized_kv_cache` and
`freeze_module_quantization`
    * Treat the `QuantizedKVCache` as analogous to another observer
  * Pull setting the calibration status out of`update_weight_zp_scale`
* This better matches the lifecycle detailed in `QuantizationMixin`
description
* Implement `reset_quantization_status` which is used to remove any
existing quantization configs before the current config is applied by
`initialize_quantization`

## Remove Support ##
* Removing support for recipe with multiple quantization modifiers
active at the same time (a check for this will be added by #1279)
* Remove `num_calibration_steps`, `quantize`,
`disable_quantization_observer_epoch` and `min_tokens_per_module`
* `num_calibration_steps` is already controlled by
https://github.com/vllm-project/llm-compressor/blob/42b62f5283d0234b26623fe1f1bf02a77c6e4019/src/llmcompressor/datasets/utils.py#L106
* `quantize` was implemented as a workaround for GPTQ's modifier
builder. Similar functionality may be require to support SpinQuant +
GPTQ, but such functionality should exist at a higher level
* `disable_quantization_observer_epoch` seems to implement functionality
where a model's observers are removed but quantization remains active.
This functionality is maintained by setting an "end" epoch for qmod
* `min_tokens_per_module` requires that the modifier have references to
the calibration dataset, which is disallowed by #1279. This information
is already printed in GPTQ's logs. If research still wants this tool
specifically for `QuantizationModifier`, then it can be reimplemented to
avoid using references to the calibration dataset
  
## Testing ##
* Updated tests to reflect new mixin
* Ran a set of GPTQ and QuantizationModifier examples to completion
* CI tests pass

---------

Signed-off-by: Kyle Sayers <[email protected]>
This PR updates the main README.md to introduce a "New Features"
section, improving visibility for recent major additions to LLM
Compressor.

This section highlights:

- Axolotl Sparse Finetuning Integration
(https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor)
- AutoAWQ Integration for low-bit weight quantization (#1177)
- Day 0 Llama 4 support and its use by Meta
This helps users quickly understand the latest capabilities of the
library.

---------

Signed-off-by: Rahul Tuli <[email protected]>
SUMMARY:
Add support for tracing of Gemma3:
[issue#1248](#1248).

Steps that I have done:
1. Create gemma3.py from HF and update __init__.py.
2. Classes and functions that I modified:
    2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward
2.2 Gemma3TextModel: _update_causal_mask, forward, and
_prepare_4d_causal_attention_mask_with_cache_position


TEST PLAN:
Ran:
`llmcompressor.trace --model_id google/gemma-3-4b-it --model_class
TraceableGemma3ForConditionalGeneration --ignore "lm_head"
"re:vision_tower.*" --modality vision`

Output:
<img width="796" alt="trace_output"
src="https://github.com/user-attachments/assets/8f5c9c7d-32a9-4b12-b4b2-10b6a4352846"
/>

This is my first attempt at solving this issue. It is a fun learning
experience and please review it carefully.
Gemma3 can go through tracing now, but we might need further tests for
the quantization as well.

---------

Signed-off-by: Kelvin Cheng <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Domenic Barbuzzi <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Vedant <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Co-authored-by: Domenic Barbuzzi <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added 21 commits May 4, 2025 20:19
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs changed the base branch from main to kylesayrs/autowrapper May 16, 2025 02:25
kylesayrs added 2 commits May 15, 2025 23:09
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@quantLm14
Copy link

Will this patch work with Llama4-scout? applying GPTQ.

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Base automatically changed from kylesayrs/autowrapper to main May 29, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants