[RFC]: Support pooling in V1

### Motivation.

In many use cases, such as Search, Recommendation, Content Understanding, LLMs are utilized to generate embeddings for downstream use. In these cases, we do not need the sampling steps. 
In V0, pooling/embedding use cases are supported ([doc](https://docs.vllm.ai/en/v0.8.3/models/pooling_models.html)). Currently, when a pooling task is specified, vLLM would [fallback](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/engine/arg_utils.py#L1397) to V0 on engine initialization. We propose to add pooling support in V1.

I'd like to seek some early feedback on the ideas as well as implementations in https://github.com/22quinn/vllm/pull/2 before I fully implement it.

### Proposed Change.

### V0 Overview
Key pointers:
* Worker initialize with a pooling model runner: [code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/worker/worker.py#L79)
* encode call adds a request with PoolingParams: [code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/entrypoints/llm.py#L946)
* Pooling model runner executes the model and returns the pooled hidden states: [code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/worker/pooling_model_runner.py#L148)

Key design:
* PoolingModelRunner extends GPUModelRunnerBase
* The normal ModelRunner for text generation also extends GPUModelRunnerBase: [code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/worker/model_runner.py#L1622)

### V1 Proposal
I'm leaning choice 1. 
* Choice 1: Branching based on Pooling/Sampling in model runner and every other places
* Choice 2: Create a separate pooling model runner, like v0

We have a working V1 hack: https://github.com/22quinn/vllm/pull/2 Note this is just a very rough hack to get things working - lots of polish to be done and features to be added.

Major changes needed include:
*Return EngineCoreOutput early for pooling in the main scheduler loop: [code](https://github.com/22quinn/vllm/blob/ffc9034a0bf043025ef7f4bb836a630eb33f3f5a/vllm/v1/core/sched/scheduler.py#L667)
* Return PoolingRequestOutput instead of RequestOutput in the OutputProcessor: [code](https://github.com/22quinn/vllm/blob/ffc9034a0bf043025ef7f4bb836a630eb33f3f5a/vllm/v1/engine/output_processor.py#L344)
* Add hidden states in the ModelRunnerOutput: [code](https://github.com/22quinn/vllm/blob/ffc9034a0bf043025ef7f4bb836a630eb33f3f5a/vllm/v1/worker/gpu_model_runner.py#L1131)
* Some more modifications are needed to skip sampler properly in various places
Another choice is to have a separate PoolingModelRunner just like V0, but it requires some major refactoring of existing model runners to avoid duplicated logic.

### (Optional) Frontend Review

The focus of this RFC is on the V1 backend but I took this opportunity to review the frontend and it seems it could be consolidated, but I might lack some historical context on why we implemented these within vLLM rather than leaving it to the user to further process the embedding.

Currently it supports four types of pooling tasks ([doc](https://docs.vllm.ai/en/v0.8.3/models/pooling_models.html), [code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/config.py#L82)): `embed`, `classify`, `score`, `reward`

For the API call, it supports four method ([code](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/entrypoints/llm.py#L873)) on the LLM class: `encode`, `embed`, `classify`, `score`.

Essentially, the only task needed is `embed` and the only API needed is `encode` - everything else is a wrapper and it is possible to eliminate those higher-level features (shifting freedom and responsibility to the user side)

### Feedback Period.

_No response_

### CC List.

@houseroad @simon-mo @WoosukKwon @njhill @DarkLight1337 @robertgshaw2-redhat 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support pooling in V1 #18052

Motivation.

Proposed Change.

V0 Overview

V1 Proposal

(Optional) Frontend Review

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support pooling in V1 #18052

Description

Motivation.

Proposed Change.

V0 Overview

V1 Proposal

(Optional) Frontend Review

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions