-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Add support for encoder embedding models #19988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Max de Bayser <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @maxdebayser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the system's capability by introducing robust support for encoder embedding models and the necessary token_type_ids
they often require. I've adapted the attention mechanisms, KV cache management, and core model logic to seamlessly integrate these new model types, ensuring broader compatibility and efficient execution.
Highlights
- Encoder Model Support: I've added comprehensive support for encoder embedding models, which do not utilize a causal attention mask. This expands the range of models that can be efficiently served by the system.
- Token Type ID Handling: Full end-to-end support for
token_type_ids
has been implemented. This is crucial for models likecross-encoder/ms-marco-MiniLM-L-6-v2
that require these IDs to be passed from the tokenizer to the model's forward pass. - FlashAttention Integration: The FlashAttention backend (
FlashAttentionImpl
) now supportsAttentionType.ENCODER_ONLY
models by dynamically determining causality based on the attention type. This allows encoder models to leverage the performance benefits of FlashAttention. - KV Cache Adaptations: The KV cache mechanism has been updated to support encoder-only models for prefix caching. Additionally, I've introduced logic to automatically disable features like prefix caching and chunked prefill when non-decoder (e.g., encoder) attention types are detected, as these optimizations are not applicable or can cause issues with non-causal models.
- Model Compatibility & Refinements: I've removed the
SupportsV0Only
interface fromBertEmbeddingModel
,BertForSequenceClassification
, andRobertaForSequenceClassification
, signifying their full compatibility with the V1 engine. The RoBERTa embedding layer's position ID handling has also been refined to correctly manage its specific requirements. - Temporary CUDA Graph Limitation: For pooling operations with encoder models, CUDA graphs are temporarily disabled and eager mode is enforced due to current compatibility limitations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds support for encoder embedding models and models that require token_type_ids
to be passed from the tokenizer to the model. The changes include modifications to the test suite, configuration files, and model execution logic to accommodate these new model types. The code review identified some areas for improvement, including adding more descriptive error messages, making code conditional on model architecture, and simplifying logic.
pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct", | ||
marks=[pytest.mark.skip_v1]), | ||
marks=[pytest.mark.skip_v0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warning("CUDA graph is not supported for pooling yet, " | ||
"fallback to the eager mode.") | ||
self.enforce_eager = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning message and the subsequent setting of self.enforce_eager
could be made conditional on the specific model architecture or a more general flag that indicates whether the model is an encoder model. This would avoid unnecessary warnings and eager mode enforcement for models that might not have this issue.
if self.runner_type == "pooling":
# WIP: currently cuda graphs are not working for encoder models.
logger.warning("CUDA graph is not supported for pooling yet, "
"fallback to the eager mode.")
self.enforce_eager = True
if attn_type not in [ | ||
AttentionType.DECODER, AttentionType.ENCODER_ONLY | ||
]: | ||
raise NotImplementedError("Encoder/decoder cross-attention " | ||
"is not implemented for " | ||
"FlashAttentionImpl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NotImplementedError
message is not very descriptive. It should indicate what types of attention are not implemented, and what attention types are supported.
if attn_type not in [
AttentionType.DECODER, AttentionType.ENCODER_ONLY
]:
raise NotImplementedError("FlashAttentionImpl only supports DECODER and ENCODER_ONLY attention types.")
if isinstance(spec, AttentionSpec) and \ | ||
spec.attn_type != "decoder": | ||
|
||
logger.info("Found non-decoder layer. Disabling " | ||
"prefix cache and chunked prefill") | ||
self.vllm_config.cache_config.\ | ||
enable_prefix_caching = False | ||
self.vllm_config.scheduler_config.\ | ||
enable_chunked_prefill = False | ||
self.vllm_config.scheduler_config.\ | ||
chunked_prefill_enabled = False | ||
self.vllm_config.scheduler_config.\ | ||
long_prefill_token_threshold = 0 | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check to ensure that prefix caching and chunked prefill are not enabled when a non-decoder layer is found. This would provide a more explicit error message to the user, rather than silently disabling the features.
if isinstance(spec, AttentionSpec) and \ | |
spec.attn_type != "decoder": | |
logger.info("Found non-decoder layer. Disabling " | |
"prefix cache and chunked prefill") | |
self.vllm_config.cache_config.\ | |
enable_prefix_caching = False | |
self.vllm_config.scheduler_config.\ | |
enable_chunked_prefill = False | |
self.vllm_config.scheduler_config.\ | |
chunked_prefill_enabled = False | |
self.vllm_config.scheduler_config.\ | |
long_prefill_token_threshold = 0 | |
break | |
for kv_cache_spec_one_worker in kv_cache_specs: | |
for _, spec in kv_cache_spec_one_worker.items(): | |
if isinstance(spec, AttentionSpec) and \ | |
spec.attn_type != "decoder": | |
if self.vllm_config.cache_config.enable_prefix_caching or \ | |
self.vllm_config.scheduler_config.enable_chunked_prefill: | |
raise ValueError("Prefix caching and chunked prefill are not supported for non-decoder models.") |
0, | ||
torch.from_numpy(token_indices), | ||
out=self.input_ids_cpu[:total_num_scheduled_tokens]) | ||
if self.input_batch.token_type_ids_cpu_tensor is not None: | ||
token_type_ids = torch.index_select( | ||
self.input_batch.token_type_ids_cpu_tensor.flatten(), 0, | ||
torch.from_numpy(token_indices)) | ||
# Copy the tensors to the GPU. | ||
self.get_token_type_ids()[:total_num_scheduled_tokens]\ | ||
.copy_(token_type_ids, non_blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check to ensure that self.input_batch.token_type_ids_cpu_tensor
is not None before attempting to copy the token type IDs to the GPU. This would prevent a potential error if the model does not use token type IDs.
if self.input_batch.token_type_ids_cpu_tensor is not None:
token_type_ids = torch.index_select(
self.input_batch.token_type_ids_cpu_tensor.flatten(), 0,
torch.from_numpy(token_indices))
# Copy the tensors to the GPU.
self.get_token_type_ids()[:total_num_scheduled_tokens]
.copy_(token_type_ids, non_blocking=True)
has_token_types = self.token_type_ids is not None | ||
model_kwargs = {} | ||
|
||
if self.is_multimodal_model and get_pp_group().is_first_rank: | ||
# NOTE(woosuk): To unify token ids and soft tokens (vision | ||
# embeddings), we always use embeddings (rather than token ids) | ||
# as input to the multimodal model, even when the input is text. | ||
input_ids = self.input_ids[:num_scheduled_tokens] | ||
if has_token_types: | ||
model_kwargs["token_type_ids"] = cast( | ||
torch.Tensor, self.token_type_ids)[:num_scheduled_tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check to ensure that has_token_types
is True before attempting to access self.token_type_ids
. This would prevent a potential error if the model does not use token type IDs.
input_ids = self.input_ids[:num_scheduled_tokens]
if has_token_types:
model_kwargs["token_type_ids"] = cast(
torch.Tensor, self.token_type_ids)[:num_scheduled_tokens]
if mm_embeds:
This PR is a follow-up to #16188 . It adds support for encoder models which don't have a causal attention mask. It also adds support for models such as
cross-encoder/ms-marco-MiniLM-L-6-v2
that requiretoken_type_ids
ids to be passed from the tokenizer to the model.