-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[Kernel] Add Conch Triton Attention backend #19625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Kernel] Add Conch Triton Attention backend #19625
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jmanning-stackav, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new attention backend based on the Conch library's Triton kernels. The primary goal is to offer a performance alternative to the existing Triton backend within vLLM's V1 engine, especially targeting improvements on platforms like MI300X and potentially for longer sequences. The change involves adding the Conch dependency and integrating the new backend into vLLM's attention backend selection logic.
Highlights
- New Attention Backend: Adds a new V1 attention backend leveraging the Conch library, which provides Triton-based kernels for variable-length attention with KV cache, reshape_and_cache, and scaled FP8 quantization.
- Performance Improvements: The Conch backend aims to provide better performance than the existing vLLM Triton backend, particularly noted for decode performance (using FlashDecoding) and showing significant gains on MI300X in provided benchmarks.
- Platform Support: The Conch backend is intended to be compatible with any hardware supported by Triton, with specific testing reported on H100 and MI300X.
- Configuration: The new backend can be enabled by setting the environment variable
VLLM_ATTENTION_BACKEND=CONCH
.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds the Conch Triton Attention backend, improving performance, especially on MI300X. The integration is well-structured, and the new backend code is clear. I've suggested a minor formatting improvement and a docstring correction for better consistency.
31a5525
to
6cf839f
Compare
Signed-off-by: Jacob Manning <[email protected]>
6cf839f
to
064b755
Compare
9252d6c
to
064b755
Compare
Signed-off-by: Jacob Manning <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR adds a new V1 Attention backend for Conch. Conch implements all required kernels in Triton (varlen attention w/ kv cache, reshape_and_cache, and scaled_fp8_quant) and should be compatible with any hardware platform that is supported by Triton (but this PR has only been tested on H100 and MI300X).
Why add another Triton backend? We already have one!
In my testing of both microbenchmarks and end-to-end serving, Conch provides slightly better performance for both prefill and decode attention than the existing Triton backend in vLLM. I haven't invested a tremendous amount of time into analyzing the differences, but I believe Conch performs better on prefill because vLLM's Triton attention impl has unnecessary loop iterations and Conch performs better on decode because it uses FlashDecoding for long sequences.
Conch's implementation of
reshape_and_cache
likely gives similar performance to CUDA, but I haven't tried tuning it. Conch's implementation ofscaled_fp8_quant
is likely a bit slower than CUDA, but again, I haven't spent time trying to optimize it.Test Plan
Disclaimer: This is my first PR to vLLM, so I'm happy to run additional testing/performance measurements.
I measured end-to-end serving performance on both H100 and MI300X via the following commands (which I copied from another kernel PR; I'm not sure if this is standard at all):
I modified the attention backend to test Flash Attention (H100), Triton Attention (H100 and MI300X) and Conch (H100 and MI300X).
Test Result
H100
(Baseline) Flash Attention
(Baseline) Triton Attention
Conch
MI300X
(Baseline) Triton Attention
Conch
Conclusion
Conch performs slightly better than the existing Triton backend on H100, but quite a bit better on MI300X. I also only tested on relatively short sequences, but the difference should be more pronounced for long sequences (>=4096 tokens) because Conch is using FlashDecoding. I also have more benchmark results for quantized models and FP8 KV cache that I can share as well, and I'm happy to collect more as-needed. Please ask any questions below and thank you in advance for your review!