Skip to content

Commit dcfdf59

Browse files
ariG23498hmellor
andauthored
[Add] Blog post on transformers backend integration with vLLM (#50)
* add transformers backend blog post Signed-off-by: ariG23498 <[email protected]> OK Signed-off-by: ariG23498 <[email protected]> * Apply suggestions from code review Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: ariG23498 <[email protected]> * Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: ariG23498 <[email protected]> OK Signed-off-by: ariG23498 <[email protected]> * Update _posts/2025-04-11-transformers-backend.md Signed-off-by: Harry Mellor <[email protected]> --------- Signed-off-by: ariG23498 <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
1 parent 5e19ec3 commit dcfdf59

File tree

2 files changed

+163
-0
lines changed

2 files changed

+163
-0
lines changed
+163
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
layout: post
3+
title: "Transformers backend integration in vLLM"
4+
author: "The Hugging Face Team"
5+
image: /assets/figures/transformers-backend/transformers-backend.png
6+
thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png
7+
share-img: /assets/figures/transformers-backend/transformers-backend.png
8+
---
9+
10+
The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index)
11+
offers a flexible, unified interface to a vast ecosystem of model architectures. From research to
12+
fine-tuning on custom dataset, transformers is the go-to toolkit for all.
13+
14+
But when it comes to *deploying* these models at scale, inference speed and efficiency often take
15+
center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput
16+
inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.
17+
18+
A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models.
19+
vLLM will therefore optimize throughput/latency on top of existing transformers architectures.
20+
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility**
21+
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.
22+
23+
## Transformers and vLLM: Inference in Action
24+
25+
Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how
26+
these libraries stack up.
27+
28+
**Infer with transformers**
29+
30+
The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze:
31+
32+
```py
33+
from transformers import pipeline
34+
35+
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
36+
result = pipe("The future of AI is")
37+
38+
print(result[0]["generated_text"])
39+
```
40+
41+
This approach is perfect for prototyping or small-scale tasks, but it’s not optimized for high-volume
42+
inference or low-latency deployment.
43+
44+
**Infer with vLLM**
45+
46+
vLLM takes a different track, prioritizing efficiency with features like `PagedAttention`
47+
(a memory-efficient attention mechanism) and dynamic batching. Here’s the same task in vLLM:
48+
49+
```py
50+
from vllm import LLM, SamplingParams
51+
52+
llm = LLM(model="meta-llama/Llama-3.2-1B")
53+
params = SamplingParams(max_tokens=20)
54+
outputs = llm.generate("The future of AI is", sampling_params=params)
55+
print(f"Generated text: {outputs[0].outputs[0].text}")
56+
```
57+
58+
vLLM’s inference is noticeably faster and more resource-efficient, especially under load.
59+
For example, it can handle thousands of requests per second with lower GPU memory usage.
60+
61+
## vLLM’s Deployment Superpower: OpenAI Compatibility
62+
63+
Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for
64+
external services. Launch a server:
65+
66+
```bash
67+
vllm serve meta-llama/Llama-3.2-1B
68+
```
69+
70+
Then query it with curl:
71+
72+
```bash
73+
curl http://localhost:8000/v1/completions \
74+
-H "Content-Type: application/json" \
75+
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'
76+
```
77+
78+
Or use Python’s OpenAI client:
79+
80+
```py
81+
from openai import OpenAI
82+
83+
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
84+
completion = client.completions.create(
85+
model="meta-llama/Llama-3.2-1B",
86+
prompt="San Francisco is a",
87+
max_tokens=7,
88+
temperature=0
89+
)
90+
print("Completion result:", completion.choices[0].text)
91+
```
92+
93+
This compatibility slashes costs and boosts control, letting you scale inference locally with vLLM’s optimizations.
94+
95+
## Why do we need the transformers backend?
96+
97+
The transformers library is optimized for contributions and
98+
[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new
99+
model to vLLM on the other hand is a little
100+
[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html).
101+
102+
In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to
103+
transformers. With the integration of the transformers backend, we step towards that ideal world.
104+
105+
Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#remote-code)
106+
on how to make your transformers model compatible with vLLM for the integration to kick in.
107+
We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the
108+
changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934).
109+
110+
For a model already in transformers (and compatible with vLLM), this is what we would need to:
111+
112+
```py
113+
llm = LLM(model="new-transformers-model", model_impl="transformers")
114+
```
115+
116+
> [!NOTE]
117+
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers
118+
> implementation on its own if the model is not natively supported in vLLM.
119+
120+
Or for a custom model from the Hugging Face Hub:
121+
122+
```py
123+
llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code=True)
124+
```
125+
126+
This backend acts as a **bridge**, marrying transformers’ plug-and-play flexibility with vLLM’s
127+
inference prowess. You get the best of both worlds: rapid prototyping with transformers
128+
and optimized deployment with vLLM.
129+
130+
## Case Study: Helium
131+
132+
[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines.
133+
134+
Let’s see this in action:
135+
136+
```bash
137+
vllm serve kyutai/helium-1-preview-2b --model-impl transformers
138+
```
139+
140+
Query it with the OpenAI API:
141+
142+
```py
143+
from openai import OpenAI
144+
145+
openai_api_key = "EMPTY"
146+
openai_api_base = "http://localhost:8000/v1"
147+
148+
client = OpenAI(
149+
api_key=openai_api_key,
150+
base_url=openai_api_base,
151+
)
152+
153+
completion = client.completions.create(model="kyutai/helium-1-preview-2b", prompt="What is AI?")
154+
print("Completion result:", completion)
155+
```
156+
157+
Here, vLLM efficiently processes inputs, leveraging the transformers backend to load
158+
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers,
159+
vLLM delivers lower latency and better resource utilization.
160+
161+
By pairing Transformers’ model ecosystem with vLLM’s inference optimizations, you unlock a workflow
162+
that’s both flexible and scalable. Whether you’re prototyping a new model, deploying a custom
163+
creation, or scaling a multimodal app, this combination accelerates your path from research to production.
Loading

0 commit comments

Comments
 (0)