Skip to content

Add Cutlass integration for MoE FP8 #19843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JackChuang
Copy link

@JackChuang JackChuang commented Jun 19, 2025

Purpose

Motivation

With this integration, users can optionally enable the Cutlass backend to improve the performance of MoE FP8 workloads when needed.

Modifications

This PR integrates a Cutlass-based kernel for the MoE FP8 execution path in vLLM. Since the current Cutlass kernel does not support per-block scaling, we adapted the integration by converting the per-block scaling format into a per-tensor equivalent, making it compatible with the existing Cutlass kernel interface.

The implementation is modular and backward-compatible. A flag named VLLM_USE_CUTLASS_MOE_FP8 controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched.

Usage:

$ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ...
...

Test Result

Device: H20 * 8 (TP=8)
Model: DeepSeek-R1
vLLM version: v0.8.5.post1 (reason explained in "Found Issue")
Env variable: USE_V1=1

Server's configuration

VLLM_USE_CUTLASS_MOE_FP8=1 VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --disable-log-requests --port 8010 --model /nvme0n1/DeepSeek-R1 --trust-remote-code --max-model-len 5120 --max-num-batched-tokens 5120 --tensor-parallel-size 8 --gpu_memory_utilization 0.98 --quantization fp8

Client's configuration

benchmarks/benchmark_serving.py

declare -a input_lens=(3500)
declare -a output_lens=(1500)

Small test scenario:
	--num-prompts 16 --max-concurrency 4 --random-input-len $input_lens --random-output-len $output_lens
Large test scenario:
	--num-prompts 100 --max-concurrency 10 --random-input-len $input_lens --random-output-len $output_lens

Summarized Results

Total Token Throughput (tok/s)

Total Token Throughput (tok/s) Small (low concurrency + few prompts) Large (high concurrency + many prompts)
Baseline (Triton) 429.52 425.70
Cutlass 426.93 544.83

Conclusion

Under our setup:

  • In the large test scenario, Cutlass significantly improved throughput by 27.62%.
  • In the small test scenario, Cutlass performed similarly to the baseline, with no significant difference.

Found Issue ⚠️

This PR is currently blocked by #19923, which tracks an OOM issue on main branch during cutlass_moe_fp8() tensor allocation.
Our current PR is based on v0.8.5.post1. This is because we discovered an open issue that prevents our current setup (H20 with TP=8 running DeepSeek R1) from being rebased onto the latest version. In the older version, cutlass_moe_fp8() allocates a relatively small amount of memory, but the new implementation allocates at least 24.5GB, which appears abnormal. Our current testing environment doesn’t have sufficient memory to run it. If verification against origin/master is required, we would need help from the community to fix this issue first.

Future work

  1. After this version is merged, we plan to introduce shape-, architecture-, and model- specific kernel tuning to further optimize the Cutlass MOE FP8 kernel. This includes releasing H20-specific tuning parameters, as this version does not yet include any such tuning. On the model side, we will provide tailored tuning parameters optimized for DeepSeekR1 to better adapt to its unique workload characteristics. These enhancements will help us fully leverage the performance potential of the Cutlass kernel across various deployment scenarios.

  2. If this version gets merged, the current Cutlass kernel is per-tensor. We will release a per-block version to support users with strict accuracy requirements.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Jun 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JackChuang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the v1 label Jun 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @JackChuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily integrates a Cutlass-based kernel to accelerate Mixture-of-Experts (MoE) FP8 workloads, offering substantial performance gains for large-scale inference. It includes necessary adaptations for FP8 weight handling and introduces a new environment variable for optional activation. Additionally, it incorporates a memory leak fix in the scheduler and an improvement to FlashAttention's AOT scheduling.

Highlights

  • Performance Enhancement: Integrated a Cutlass-based kernel for MoE FP8 workloads, demonstrating significant throughput improvements (up to 27.62%) in large test scenarios, particularly beneficial for high-concurrency environments.
  • Modular Integration: Introduced a new environment variable (VLLM_USE_CUTLASS_MOE_FP8) to optionally enable the Cutlass kernel, ensuring backward compatibility and allowing users to choose their preferred execution path without affecting the default behavior.
  • FP8 Scaling Adaptation: Implemented a conversion mechanism (fp8_bf16_fp8) to adapt per-block FP8 scaling to a per-tensor format, making it compatible with the current Cutlass kernel interface, as per-block scaling is not yet supported by Cutlass.
  • Kernel Tuning Configurations: Added numerous new JSON configuration files for various NVIDIA GPU devices (H100, H20, H200) and FP8 quantization settings, providing optimized parameters for the Cutlass MoE FP8 kernels.
  • Scheduler Memory Fix: Addressed a potential memory leak in the scheduler's update_from_output method, preventing stopped requests from being inadvertently re-cached and ensuring proper memory deallocation.
  • FlashAttention AOT Improvement: Enhanced FlashAttention's Ahead-of-Time (AOT) scheduling logic to better handle sliding window configurations by ensuring consistent window sizes across attention layers, improving efficiency for certain models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the needs-rebase label Jun 19, 2025
A flag named `VLLM_USE_CUTLASS_MOE_FP8` controls whether to activate
the Cutlass kernel. By default, this flag is disabled, ensuring
the original execution path remains completely untouched.

Usage:
$ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ...

Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Co-authored-by: Yichen Wang <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces Cutlass integration for MoE FP8, which shows promising performance improvements. It also includes unrelated but valuable enhancements to FlashAttention AOT scheduling and a scheduler memory leak fix.

The main concern is the robustness of the Cutlass MoE FP8 path when dealing with models that are not already in a per-block FP8 quantized format. The current implementation seems to assume the input weights are per-block FP8, which might lead to errors or incorrect behavior if quantizing on-the-fly or using per-tensor FP8 checkpoints with the Cutlass flag enabled. This needs to be addressed or clearly documented as a prerequisite.

Other changes, including the new benchmark configurations, environment variable, and the scheduler/attention backend fixes, look good.

Comment on lines +647 to +660
else:
w13_weight, w13_weight_scale_inv = \
self.fp8_bf16_fp8(w13_weight, w13_weight_scale_inv)
w2_weight, w2_weight_scale_inv = \
self.fp8_bf16_fp8(w2_weight, w2_weight_scale_inv)

w13_weight_scale_inv = w13_weight_scale_inv.repeat(w13_weight.size(0))
w2_weight_scale_inv = w2_weight_scale_inv.repeat(w2_weight.size(0))

layer.w13_weight.data.copy_(w13_weight)
layer.w13_weight_scale_inv = Parameter(w13_weight_scale_inv, requires_grad=False)
layer.w2_weight.data.copy_(w2_weight)

layer.w2_weight_scale_inv = Parameter(w2_weight_scale_inv, requires_grad=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current logic for the Cutlass path (when envs.VLLM_USE_CUTLASS_MOE_FP8 is true) appears to correctly handle models that are already per-block FP8 quantized (i.e., self.block_quant is true).

However, it seems this path will fail if VLLM_USE_CUTLASS_MOE_FP8=1 is used with:

  • Models quantized on-the-fly from FP16/BF16 to FP8 (where self.quant_config.is_checkpoint_fp8_serialized is False).
  • Models loaded from per-tensor FP8 checkpoints (where self.block_quant is False but is_checkpoint_fp8_serialized is True).

In these scenarios, layer.w13_weight_scale_inv (and w2_weight_scale_inv) would not have been initialized as expected by this code block, leading to an AttributeError when accessed (e.g., w13_weight_scale_inv in the call to self.fp8_bf16_fp8).

Recommendation:

This path needs to be robust for all scenarios where FP8 MoE is active and Cutlass is enabled.

  • For on-the-fly quantization (FP16/BF16 -> FP8 per-tensor): The weights are already per-tensor FP8 after the block starting at line ~687 (in the full file context, handling not self.quant_config.is_checkpoint_fp8_serialized). The fp8_bf16_fp8 conversion is not needed. The existing per-tensor scales (e.g., layer.w13_weight_scale) should be used, potentially renamed/assigned to layer.w13_weight_scale_inv and then repeated as done on line 653.
  • For per-tensor FP8 checkpoints: Similar to on-the-fly, the weights are already per-tensor. The fp8_bf16_fp8 conversion is not needed. Scales from layer.w13_weight_scale should be used.

This implies that the call to fp8_bf16_fp8 should likely be conditional, e.g., only if self.block_quant is true. Otherwise, the existing per-tensor quantized weights and scales should be prepared for Cutlass (e.g., repeating scales).

Please clarify if VLLM_USE_CUTLASS_MOE_FP8 is only intended for pre-quantized per-block FP8 models. If so, this needs to be very clearly documented, and perhaps an error should be raised if the flag is used with incompatible model/quantization settings.

Comment on lines +742 to +745
# NOTE(rob): since we free stopped reqs above, adding stopped reqs
# to _cached_reqs_data will cause a memory leak.
if req_data.req_id not in self.finished_req_ids:
self._cached_reqs_data[req_data.req_id].append(req_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly prevents adding CachedRequestData for already finished requests back to the _cached_reqs_data pool. This is a good fix that avoids a potential memory leak where CachedRequestData objects for finished requests could accumulate or be incorrectly reused if request IDs were recycled.

This significantly improves the robustness of the scheduler, especially for long-running services.

Comment on lines +600 to +614
def fp8_bf16_fp8(self, fp8_tensor, fp8_scale):
blocked_tensor = fp8_tensor.view(
fp8_tensor.shape[0],
fp8_tensor.shape[1] // 128, 128,
fp8_tensor.shape[2] // 128,
128).to(torch.float32)
# Because blocked_tensor is 5D, reshape to [B, M//128, 1, N//128, 1]
dequant_tensor = (blocked_tensor *
fp8_scale.unsqueeze(2).unsqueeze(4)).view(
fp8_tensor.shape).to(torch.bfloat16).to(torch.float32)

scale_tensor = torch.abs(dequant_tensor).max() / 448
quant_tensor = dequant_tensor / scale_tensor

return quant_tensor, scale_tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function converts a per-block quantized FP8 tensor to a format suitable for a per-tensor quantized FP8 kernel. Consider adding a docstring to clarify:

  • Inputs: fp8_tensor is expected to be torch.float8_e4m3fn and fp8_scale its corresponding per-block scales.
  • Output: quant_tensor is returned as torch.float32 (representing values that would be FP8 if cast using scale_tensor), and scale_tensor is the new per-tensor scale.
  • Assumptions: The .view(...) operation on fp8_tensor (lines 601-605) assumes that fp8_tensor.shape[1] and fp8_tensor.shape[2] are divisible by 128. This should be mentioned.
  • Naming: The name fp8_bf16_fp8 could be slightly misleading. Perhaps a name like convert_fp8_block_to_tensor_scaled or similar, or a very clear docstring, would improve clarity.

Introduced optional support for using Cutlass kernels
in the MoE FP8 execution path by converting
the per-block scaling format into a per-tensor equivalent,
making it compatible with the existing Cutlass kernel interface.

Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Co-authored-by: Yichen Wang <[email protected]>
@JackChuang JackChuang force-pushed the horenc/w8a8_cutlass_pertensor_release_v0.8.5.post1 branch from 41022d2 to 84f6ceb Compare June 19, 2025 07:29
@mergify mergify bot added the performance Performance-related issues label Jun 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase performance Performance-related issues v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants