Skip to content

[Model] Activated LoRA #19710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

[Model] Activated LoRA #19710

wants to merge 19 commits into from

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Jun 16, 2025

Purpose

This PR adds support for Activated LoRA (a-LoRA): a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation string appears. This means that one can apply the aLoRA deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT.

paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora

results from paper:
image

Implementation

We have tried to make the changes as unintrusive as possible (but happy to hear any suggestions for how the PR can be improved).

If one sets the --enable-activated-lora then the following happens:

  • At model loading time, we replace the QKV projection layers with equivalent aLoRA implementation (right now, the aLoRA weights change QKV projection layers only).
  • Each aLoRA request that comes in will be scanned for the invocation tokens and we store the invocation_start in the lora_request object.
  • When computing the hash of the blocks for prefix caching, the invocation_start information is used to determine whether base-model KV cache blocks can be re-used.
  • We introduce a simple ALoRAMetadata class that is needed to pass one mask tensor down to the LoRA layer.

We have tested that the integration works with:

  • Chunked prefill
  • Torch compile
  • Prefix caching

Test Plan

We have included an offline example using an uncertainty detection aLoRA .

If the community would like to have this feature in vLLM, we are happy to add more extensive unit and integration tests.

Test Result

I've included some debug print statements in the scheduler to illustrate explicitly the KV cache re-use when applying the aLoRA:

$ python examples/alora/alora_offline_example.py
...
INFO 06-19 14:17:13 [scheduler.py:427] request_id:          0
INFO 06-19 14:17:13 [scheduler.py:428] num_tokens:          12
INFO 06-19 14:17:13 [scheduler.py:429] num_computed_tokens: 0
INFO 06-19 14:17:13 [scheduler.py:430] num_new_tokens:      12
Prompt: '<|start_of_role|>user<|end_of_role|>What is MIT?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>', Generated text: '1. MIT, or Massachusetts Institute of Technology, is a prestigious private research university located in Cambridge, Massachusetts, USA.\n2. It was founded in 1861 and is known for its strong programs in science, technology, engineering, and mathematics (STEM).\n3. MIT is often ranked as one of the top universities globally and is a member of the Ivy League.\n4. It is renowned for its innovative research, influential faculty, and notable alumni.'
WARNING 06-19 14:17:14 [tokenizer.py:295] No tokenizer found in /home/zrltpa/.cache/huggingface/hub/models--ibm-granite--granite-3.2-8b-alora-uncertainty/snapshots/0d8ce48cdd4280a1e8fc37aa1de07537670ecf21, using base model tokenizer instead. (Exception: <class 'transformers.models.granite.configuration_granite.GraniteConfig'>)
INFO 06-19 14:17:14 [scheduler.py:427] request_id:          1
INFO 06-19 14:17:14 [scheduler.py:428] num_tokens:          139
INFO 06-19 14:17:14 [scheduler.py:429] num_computed_tokens: 128
INFO 06-19 14:17:14 [scheduler.py:430] num_new_tokens:      11
Time: 0.5810742378234863
Prompt: '<|start_of_role|>user<|end_of_role|>What is MIT?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>1. MIT, or Massachusetts Institute of Technology, is a prestigious private research university located in Cambridge, Massachusetts, USA.\n2. It was founded in 1861 and is known for its strong programs in science, technology, engineering, and mathematics (STEM).\n3. MIT is often ranked as one of the top universities globally and is a member of the Ivy League.\n4. It is renowned for its innovative research, influential faculty, and notable alumni.<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>', Generated text: '85.75%'

(Optional) Documentation Update

tbd

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tdoublep, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers the foundational components for Activated LoRA (aLoRA) within the vLLM framework. It enables dynamic application of LoRA weights based on prompt content, introduces necessary metadata structures, optimizes KV cache usage for these requests, and provides comprehensive examples for testing and deployment.

Highlights

  • Activated LoRA (aLoRA) Implementation: This PR introduces the initial working implementation of Activated LoRA (aLoRA), a technique that allows for selective application of LoRA weights based on specific 'invocation tokens' within the prompt. This enables dynamic switching between base model and LoRA weights during inference.
  • Core aLoRA Logic: The central mechanism for aLoRA is implemented in vllm/lora/layers.py, where a mask is dynamically generated based on k_offsets and query_start_locs. This mask determines which parts of the output should use the base model's computations and which should incorporate the LoRA adjustments, effectively blending the two outputs.
  • KV Cache Optimization for aLoRA: To optimize performance, the KV cache utility (vllm/v1/core/kv_cache_utils.py) has been updated to allow KV cache sharing for aLoRA requests. Specifically, the portion of the prompt before the aLoRA invocation tokens is treated as a base model request for caching purposes, reducing redundant computations.
  • Metadata and Request Handling: New fields (invocation_tokens, k_offset) have been added to LoRARequest to define the aLoRA activation. A new ALoRAMetadata dataclass is introduced in vllm/forward_context.py to pass these activation-specific details through the model's forward pass. The engine processor and GPU model runner are updated to extract and utilize this metadata, including tokenizing invocation strings from adapter configurations.
  • Testing and Examples: New example scripts (examples/alora/alora_server_testing.py, alora_server_testing.sh, new_alora_testing.py) are provided to demonstrate how to set up and interact with a vLLM server running aLoRA, both via the OpenAI-compatible API and directly through the vLLM Python API.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jun 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an initial implementation of Activated LoRA (aLoRA). The changes include adding new example scripts, modifying core components like the forward context, LoRA request, KV cache utilities, scheduler, processor, and GPU model runner to support aLoRA metadata extraction and application. The core logic for identifying the aLoRA invocation sequence and applying the mask seems correctly implemented. Feedback includes addressing a type mismatch in a metadata class, removing a debug print statement, and clarifying the purpose of layer registration in the compilation config.

@tdoublep tdoublep changed the title [Model] Initial working implementation of Activated LoRA [Model] Activated LoRA Jun 16, 2025
Co-authored-by: Greenewald <[email protected]>
Co-authored-by: Allison Li <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Copy link

mergify bot commented Jun 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 19, 2025
tdoublep added 3 commits June 19, 2025 14:10
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
@tdoublep tdoublep marked this pull request as ready for review June 19, 2025 14:30
Signed-off-by: Thomas Parnell <[email protected]>
Comment on lines +7 to +8
import regex as re

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is needed due to some linting issue on main currently, unrelated to this PR)

tdoublep added 2 commits June 19, 2025 17:04
… checking activated lora flag

Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend tool-calling v1
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant