[RFC]: KV cache offloading

### Motivation.

Currently, in vLLM v1 there is no in-house solution for offloading KV cache data from the GPU memory to other medium (in particular, CPU memory).
There is a proposed RFC (#16144) and respective PRs (#13377 and #17653) that try to address that.
The approach they take is somewhat similar to the way offloading was implemented in V0:
1. On the scheduler side, extend the core GPU allocator (KVCacheManager) to support CPU offloading
2. On the worker side, add a synchronous call to handle the actual CPU<->GPU transfer in the `execute_model` function.

In this RFC I propose an alternative approach which supports the following requirements:
* **Async saving** of new KV data from GPU to cache. GPU memory will not be freed until save is completed (similar to NixlConnector)
* **Async loading** of KV data from the cache to GPU. Requests waiting for cache load won't be scheduled until load is completed (similar to NixlConnector)
* Support **pluggable backends** for cache (CPU backend for start)
* Allow pulling of **cache events** (cache insert, evict, access), in the same way as the GPU cache. This will allow a unified KVCacheEvent stream.
* Enable **LRU eviction** of offloaded data


### Proposed Change.

![Image](https://github.com/user-attachments/assets/00cc58a5-d18a-4a9a-958f-6f595c1748f0)

I suggest we will enable offloading using a new OffloadingConnector, with minimal changes to vLLM's core.

On the scheduler side, this connector will delegate to an abstract `OffloadingManager`.
The `OffloadingManager` will be responsible for bookkeeping allocation and evictions of offloaded data. Its output will be encoded over `KVConnectorMetadata` and sent to workers.

On the worker side, we will also have an OffloadingConnector which parses load/store requests sent by the `OffloadingManager` and executed asynchronously using a set of worker threads, one per transfer type (e.g. 1 thread doing GPU->CPU, 1 for CPU->GPU, and in the future also 1 CPU->Disk, etc.).

Each transfer request submitted by the `OffloadingManager` will be responded by a unique `job_id`, which will be used to track completions.

To enable this design, we need 3 changes in vLLM's core:
1. PR #19555 Introduce connector-metadata also in the direction of worker->scheduler (currently, only the scheduler connector can pass-on metadata to workers, but not the other way around).
2. PR #19728 Introduce `Request.block_hashes` to allow the `OffloadingConnector` to re-use the block-hashes computed by the KVCacheManager.
3. PR #19737 Add a connector API for collecting KV cache events (cache insertion, deletion).

Aside from the above PRs with changes to vLLM's core, I already opened PR #19848 including the basis of pluggable offloading implementation.
On-top of this PR, there will be PRs for a concrete CPU offloading implementation (probably one PR for the scheduler side, and one PR for the worker side).
The last step will be to introduce the actual `OffloadingConnector` that will enable the e2e use of offloading in v1.

### Feedback Period.

One week.

### CC List.

@WoosukKwon @simon-mo  @robertgshaw2-redhat @njhill

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: KV cache offloading #19854

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: KV cache offloading #19854

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions