Description
Motivation.
In many use cases, such as Search, Recommendation, Content Understanding, LLMs are utilized to generate embeddings for downstream use. In these cases, we do not need the sampling steps.
In V0, pooling/embedding use cases are supported (doc). Currently, when a pooling task is specified, vLLM would fallback to V0 on engine initialization. We propose to add pooling support in V1.
I'd like to seek some early feedback on the ideas as well as implementations in 22quinn#2 before I fully implement it.
Proposed Change.
V0 Overview
Key pointers:
- Worker initialize with a pooling model runner: code
- encode call adds a request with PoolingParams: code
- Pooling model runner executes the model and returns the pooled hidden states: code
Key design:
- PoolingModelRunner extends GPUModelRunnerBase
- The normal ModelRunner for text generation also extends GPUModelRunnerBase: code
V1 Proposal
I'm leaning choice 1.
- Choice 1: Branching based on Pooling/Sampling in model runner and every other places
- Choice 2: Create a separate pooling model runner, like v0
We have a working V1 hack: 22quinn#2 Note this is just a very rough hack to get things working - lots of polish to be done and features to be added.
Major changes needed include:
*Return EngineCoreOutput early for pooling in the main scheduler loop: code
- Return PoolingRequestOutput instead of RequestOutput in the OutputProcessor: code
- Add hidden states in the ModelRunnerOutput: code
- Some more modifications are needed to skip sampler properly in various places
Another choice is to have a separate PoolingModelRunner just like V0, but it requires some major refactoring of existing model runners to avoid duplicated logic.
(Optional) Frontend Review
The focus of this RFC is on the V1 backend but I took this opportunity to review the frontend and it seems it could be consolidated, but I might lack some historical context on why we implemented these within vLLM rather than leaving it to the user to further process the embedding.
Currently it supports four types of pooling tasks (doc, code): embed
, classify
, score
, reward
For the API call, it supports four method (code) on the LLM class: encode
, embed
, classify
, score
.
Essentially, the only task needed is embed
and the only API needed is encode
- everything else is a wrapper and it is possible to eliminate those higher-level features (shifting freedom and responsibility to the user side)
Feedback Period.
No response
CC List.
@houseroad @simon-mo @WoosukKwon @njhill @DarkLight1337 @robertgshaw2-redhat
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.