MLA - Flashinfer Ragged Prefill #20034

alexm-redhat · 2025-06-24T18:28:47Z

This is a draft PR that runs the flashinfer ragged prefill for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct. It is in rough form but has correctness. Currently, there is a slowdown for using the flashinfer ragged prefill. For example:

Batch size 1 with FlashInfer Ragged Prefill

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  0.67      
Total input tokens:                      999       
Total generated tokens:                  100       
Request throughput (req/s):              1.48      
Output token throughput (tok/s):         148.28    
Total Token throughput (tok/s):          1629.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          37.58     
Median TTFT (ms):                        37.58     
P99 TTFT (ms):                           37.58     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.43      
Median TPOT (ms):                        6.43      
P99 TPOT (ms):                           6.43      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.43      
Median ITL (ms):                         6.42      
P99 ITL (ms):                            6.68      
==================================================

Batch size 1 with the original main - uses FA2 for prefill

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  0.65      
Total input tokens:                      999       
Total generated tokens:                  100       
Request throughput (req/s):              1.53      
Output token throughput (tok/s):         153.49    
Total Token throughput (tok/s):          1686.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          29.76     
Median TTFT (ms):                        29.76     
P99 TTFT (ms):                           29.76     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.27      
Median TPOT (ms):                        6.27      
P99 TPOT (ms):                           6.27      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.27      
Median ITL (ms):                         6.25      
P99 ITL (ms):                            6.74      
==================================================

Batch size 100 with FlashInfer Ragged Prefill

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  3.63      
Total input tokens:                      99900     
Total generated tokens:                  10000     
Request throughput (req/s):              27.51     
Output token throughput (tok/s):         2751.26   
Total Token throughput (tok/s):          30236.31  
---------------Time to First Token----------------
Mean TTFT (ms):                          1196.92   
Median TTFT (ms):                        1253.75   
P99 TTFT (ms):                           2187.23   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.25     
Median TPOT (ms):                        22.71     
P99 TPOT (ms):                           33.14     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.25     
Median ITL (ms):                         13.48     
P99 ITL (ms):                            185.82    
==================================================

Batch size 100 with the original main - uses FA2 for prefill

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  2.63      
Total input tokens:                      99900     
Total generated tokens:                  10000     
Request throughput (req/s):              37.99     
Output token throughput (tok/s):         3799.28   
Total Token throughput (tok/s):          41754.09  
---------------Time to First Token----------------
Mean TTFT (ms):                          659.17    
Median TTFT (ms):                        666.53    
P99 TTFT (ms):                           1196.01   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.57     
Median TPOT (ms):                        18.52     
P99 TPOT (ms):                           24.14     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.57     
Median ITL (ms):                         13.39     
P99 ITL (ms):                            108.37    
==================================================

Things I'm currently doing to improve this result:

Enable cuda graph for FlashInfer ragged prefill (requires creating a bunch of planned instances with fixed batch sizes). This is in progress and will be ready today.
Trace profiling
What SGLang is doing?
Anything else?

github-actions · 2025-06-24T18:28:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @alexm-redhat, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the initial integration of FlashInfer's ragged prefill functionality into the MLA backend. The aim is to leverage FlashInfer for prefill operations, with current benchmarks indicating a performance regression compared to the existing FlashAttention 2 implementation. The changes primarily involve setting up FlashInfer-specific metadata, managing its prefill wrappers, and adapting the attention computation logic to utilize the new backend, along with an updated example for testing.

Highlights

FlashInfer Ragged Prefill Integration: I've integrated FlashInfer's BatchPrefillWithRaggedKVCacheWrapper to handle ragged prefill operations within the Multi-Layer Attention (MLA) backend. This involves setting up the necessary FlashInfer wrappers and planning their execution.
FlashInfer-Specific Metadata and Hyperparameter Management: I've introduced a new FIPrefillMetadata dataclass to store FlashInfer-specific prefill parameters and wrappers. Additionally, I've added logic to extract and validate attention layer hyperparameters (such as window size and scale) from the VLLM configuration, ensuring these are consistent and correctly passed to the FlashInfer kernels.
Attention Logic Adaptation for Prefill: The core attention computation paths for prefill (_compute_prefill_context and _forward_prefill) have been updated. They now utilize a new _fi_prefill_run method, which dispatches to the FlashInfer prefill kernel, replacing the previous FlashAttention 2 calls for these operations.
Model Update for Benchmarking: The basic.py offline inference example has been updated to use the deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct model. This change facilitates focused testing and benchmarking of the new FlashInfer ragged prefill integration with a specific model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces FlashInfer ragged prefill for the DeepSeek-Coder-V2-Lite-Instruct model. While the changes have correctness, there's a performance slowdown. The author is actively working on improvements like enabling CUDA graph and trace profiling. The review identifies areas for code cleanup, consistency, and verification of key logic.

gemini-code-assist · 2025-06-24T18:30:47Z

vllm/v1/attention/backends/mla/common.py

+                cu_seqlens_q=attn_metadata.prefill.query_start_loc,
+                cu_seqlens_k=attn_metadata.prefill.query_start_loc,


The cu_seqlens_k argument is being passed the same value as cu_seqlens_q. Verify that this is the intended behavior, as it might lead to incorrect attention calculations if the key and query sequence lengths are different.

gemini-code-assist · 2025-06-24T18:30:47Z

vllm/v1/attention/backends/mla/common.py

+                max_seqlen_q=attn_metadata.prefill.max_query_len,
+                max_seqlen_k=attn_metadata.prefill.max_query_len,


The max_seqlen_k argument is being passed the same value as max_seqlen_q. Verify that this is the intended behavior, as it might lead to incorrect attention calculations if the key and query sequence lengths are different.

gemini-code-assist · 2025-06-24T18:30:47Z

vllm/v1/attention/backends/mla/common.py

            cu_seqlens_q=attn_metadata.prefill.query_start_loc,
            cu_seqlens_k=attn_metadata.prefill.query_start_loc,


The cu_seqlens_k argument is being passed the same value as cu_seqlens_q. Verify that this is the intended behavior, as it might lead to incorrect attention calculations if the key and query sequence lengths are different.

gemini-code-assist · 2025-06-24T18:30:47Z

vllm/v1/attention/backends/mla/common.py

            max_seqlen_q=attn_metadata.prefill.max_query_len,
            max_seqlen_k=attn_metadata.prefill.max_query_len,


The max_seqlen_k argument is being passed the same value as max_seqlen_q. Verify that this is the intended behavior, as it might lead to incorrect attention calculations if the key and query sequence lengths are different.

gemini-code-assist · 2025-06-24T18:30:48Z

vllm/v1/attention/backends/flashinfer.py

@@ -564,6 +564,8 @@ def forward(
        """
        assert output is not None, "Output tensor must be provided."

+        print("kv_cache.shape = {}".format(kv_cache.shape))


This print statement should be removed before merging. It's useful for debugging, but not for production code.

gemini-code-assist · 2025-06-24T18:30:48Z

vllm/v1/attention/backends/mla/common.py

+            192,  #head_dim_qk,
+            causal=True,
+            head_dim_vo=128,


The head dimensions are hardcoded here. It would be better to use variables or constants defined elsewhere in the code to ensure consistency and avoid magic numbers.

head_dim_qk, #head_dim_qk, causal=True, head_dim_vo=self.kv_cache_spec.head_size,

gemini-code-assist · 2025-06-24T18:30:48Z

vllm/v1/attention/backends/mla/common.py

+                    192,  #head_dim_qk,
+                    causal=False,
+                    head_dim_vo=128,


The head dimensions are hardcoded here. It would be better to use variables or constants defined elsewhere in the code to ensure consistency and avoid magic numbers.

head_dim_qk, #head_dim_qk, causal=False, head_dim_vo=self.kv_cache_spec.head_size,

gemini-code-assist · 2025-06-24T18:30:48Z

vllm/v1/attention/backends/mla/common.py

+        # print("INSIDE forward")
+        # print("  kv_cache.shape = {}".format(kv_cache.shape))


This print statement should be removed before merging. It's useful for debugging, but not for production code.

alexm-redhat added 6 commits June 17, 2025 10:24

tmp commit

399dd39

tmp 2

b556b01

works!

4da0740

works!

5e39081

fixed!

6d5bc12

add chunks

4ca0726

alexm-redhat requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96 and comaniac as code owners June 24, 2025 18:28

alexm-redhat marked this pull request as draft June 24, 2025 18:28

gemini-code-assist bot reviewed Jun 24, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation v1 labels Jun 24, 2025

gemini-code-assist bot reviewed Jun 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MLA - Flashinfer Ragged Prefill #20034

MLA - Flashinfer Ragged Prefill #20034

alexm-redhat commented Jun 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

Uh oh!

		cu_seqlens_q=attn_metadata.prefill.query_start_loc,
		cu_seqlens_k=attn_metadata.prefill.query_start_loc,

		max_seqlen_q=attn_metadata.prefill.max_query_len,
		max_seqlen_k=attn_metadata.prefill.max_query_len,

		# print("INSIDE forward")
		# print(" kv_cache.shape = {}".format(kv_cache.shape))

Uh oh!

MLA - Flashinfer Ragged Prefill #20034

Are you sure you want to change the base?

MLA - Flashinfer Ragged Prefill #20034

Conversation

alexm-redhat commented Jun 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexm-redhat commented Jun 24, 2025 •

edited by github-actions bot

Loading