-
Notifications
You must be signed in to change notification settings - Fork 159
Use torch.compile to speed up GPTQ algo #1561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @aladerran, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request aims to enhance the performance of the GPTQ quantization process by integrating torch.compile
. The changes primarily focus on optimizing the quantize_weight
function, which is a critical component of the GPTQ algorithm, to achieve faster execution times.
Highlights
- Performance Optimization: The core change involves applying
torch.compile
to thequantize_weight
function within the GPTQ quantization algorithm to significantly speed up its execution. - TorchDynamo Configuration: Necessary
torch._dynamo.config
imports and settings, specificallytorch._dynamo.config.capture_scalar_outputs = True
, have been added to ensure proper compilation and avoid potential issues with scalar outputs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully integrates torch.compile
into the GPTQ quantization process, which the provided profiling results indicate leads to significant speedups. The changes involve adding the necessary import, applying the @torch.compile
decorator, and setting a torch._dynamo.config
flag. The core change of applying the decorator is well-placed. The only feedback is a minor suggestion regarding the global configuration setting for torch.compile
to improve code clarity and maintainability.
@@ -16,6 +17,8 @@ | |||
from llmcompressor.observers.base import Observer | |||
from llmcompressor.pytorch.utils.helpers import tensor_sparsity | |||
|
|||
torch._dynamo.config.capture_scalar_outputs = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting torch._dynamo.config.capture_scalar_outputs = True
at the module level applies this configuration globally to any code that imports this module. While this might be necessary for torch.compile
to function correctly with the quantize_weight
function, it's a broad setting that could potentially affect other parts of the codebase in unexpected ways. Consider adding a brief comment explaining why this setting is needed specifically for this module/function and acknowledging its global scope.
torch._dynamo.config.capture_scalar_outputs = True | |
# Enable scalar capture for torch.compile, potentially needed for control flow | |
torch._dynamo.config.capture_scalar_outputs = True |
Benchmarking script I used:
|
Hi @aladerran! Thank you for your contribution and thorough profiling data! It seems like the new runtime is about 86% of the original, a notable improvement! This change should be good to merge now, but there are a few other small modifications to the gptq_quantize method that have the potential to drastically improve runtime. Specifically, removing branching logic in the algorithm in order to reduce graph breaks. You can debug graph breaks with |
Hi @kylesayrs, Thank you for the feedback! I'll look into further optimizing the runtime. |
Signed-off-by: aladerran <[email protected]>
Signed-off-by: aladerran <[email protected]>
Hi @kylesayrs, I introduce quantize_weight_optimized in a new commit, which isolates the main GPTQ quantization loop into a function that can be accelerated with torch.compile. The core logic should remain functionally equivalent to the original implementation. Without torch.compile, this version already achieves ~70% of the original runtime. With torch.compile enabled, execution time drops further to ~10-20% of the original. I have updated my test script above and some of the test results are shown here: gptq_baseline_profile.txt However, there are a few considerations:
Given the overhead, should we set the torch.compile as an optional feature? Any feedback on how to best make this optimization feature would be great. |
@aladerran Amazing work! Thank you for the contribution! I'll verify this asap so we can start quantizing faster ⚡💪 |
SUMMARY:
In response to #1496, this PR uses torch.compile to speed up the GPTQ quantization process in gptq_quantize.py, along with simple benchmarking tools.
I tested on a single NVIDIA A100-SXM4-80GB, with:
PyTorch version: 2.7.0+cu126
CUDA version: 12.6
cuDNN version: 90501
gptq_baseline_profile.txt
gptq_tc_profile.txt
TEST PLAN:
First-time contributor here, please let me know if you have any tips!