|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Transformers backend integration in vLLM" |
| 4 | +author: "The Hugging Face Team" |
| 5 | +image: /assets/figures/transformers-backend/transformers-backend.png |
| 6 | +thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png |
| 7 | +share-img: /assets/figures/transformers-backend/transformers-backend.png |
| 8 | +--- |
| 9 | + |
| 10 | +The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index) |
| 11 | +offers a flexible, unified interface to a vast ecosystem of model architectures. From research to |
| 12 | +fine-tuning on custom dataset, transformers is the go-to toolkit for all. |
| 13 | + |
| 14 | +But when it comes to *deploying* these models at scale, inference speed and efficiency often take |
| 15 | +center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput |
| 16 | +inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance. |
| 17 | + |
| 18 | +A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models. |
| 19 | +vLLM will therefore optimize throughput/latency on top of existing transformers architectures. |
| 20 | +In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility** |
| 21 | +with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter. |
| 22 | + |
| 23 | +## Transformers and vLLM: Inference in Action |
| 24 | + |
| 25 | +Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how |
| 26 | +these libraries stack up. |
| 27 | + |
| 28 | +**Infer with transformers** |
| 29 | + |
| 30 | +The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze: |
| 31 | + |
| 32 | +```py |
| 33 | +from transformers import pipeline |
| 34 | + |
| 35 | +pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B") |
| 36 | +result = pipe("The future of AI is") |
| 37 | + |
| 38 | +print(result[0]["generated_text"]) |
| 39 | +``` |
| 40 | + |
| 41 | +This approach is perfect for prototyping or small-scale tasks, but it’s not optimized for high-volume |
| 42 | +inference or low-latency deployment. |
| 43 | + |
| 44 | +**Infer with vLLM** |
| 45 | + |
| 46 | +vLLM takes a different track, prioritizing efficiency with features like `PagedAttention` |
| 47 | +(a memory-efficient attention mechanism) and dynamic batching. Here’s the same task in vLLM: |
| 48 | + |
| 49 | +```py |
| 50 | +from vllm import LLM, SamplingParams |
| 51 | + |
| 52 | +llm = LLM(model="meta-llama/Llama-3.2-1B") |
| 53 | +params = SamplingParams(max_tokens=20) |
| 54 | +outputs = llm.generate("The future of AI is", sampling_params=params) |
| 55 | +print(f"Generated text: {outputs[0].outputs[0].text}") |
| 56 | +``` |
| 57 | + |
| 58 | +vLLM’s inference is noticeably faster and more resource-efficient, especially under load. |
| 59 | +For example, it can handle thousands of requests per second with lower GPU memory usage. |
| 60 | + |
| 61 | +## vLLM’s Deployment Superpower: OpenAI Compatibility |
| 62 | + |
| 63 | +Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for |
| 64 | +external services. Launch a server: |
| 65 | + |
| 66 | +```bash |
| 67 | +vllm serve meta-llama/Llama-3.2-1B |
| 68 | +``` |
| 69 | + |
| 70 | +Then query it with curl: |
| 71 | + |
| 72 | +```bash |
| 73 | +curl http://localhost:8000/v1/completions \ |
| 74 | + -H "Content-Type: application/json" \ |
| 75 | + -d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}' |
| 76 | +``` |
| 77 | + |
| 78 | +Or use Python’s OpenAI client: |
| 79 | + |
| 80 | +```py |
| 81 | +from openai import OpenAI |
| 82 | + |
| 83 | +client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") |
| 84 | +completion = client.completions.create( |
| 85 | + model="meta-llama/Llama-3.2-1B", |
| 86 | + prompt="San Francisco is a", |
| 87 | + max_tokens=7, |
| 88 | + temperature=0 |
| 89 | +) |
| 90 | +print("Completion result:", completion.choices[0].text) |
| 91 | +``` |
| 92 | + |
| 93 | +This compatibility slashes costs and boosts control, letting you scale inference locally with vLLM’s optimizations. |
| 94 | + |
| 95 | +## Why do we need the transformers backend? |
| 96 | + |
| 97 | +The transformers library is optimized for contributions and |
| 98 | +[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new |
| 99 | +model to vLLM on the other hand is a little |
| 100 | +[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html). |
| 101 | + |
| 102 | +In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to |
| 103 | +transformers. With the integration of the transformers backend, we step towards that ideal world. |
| 104 | + |
| 105 | +Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#remote-code) |
| 106 | +on how to make your transformers model compatible with vLLM for the integration to kick in. |
| 107 | +We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the |
| 108 | +changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934). |
| 109 | + |
| 110 | +For a model already in transformers (and compatible with vLLM), this is what we would need to: |
| 111 | + |
| 112 | +```py |
| 113 | +llm = LLM(model="new-transformers-model", model_impl="transformers") |
| 114 | +``` |
| 115 | + |
| 116 | +> [!NOTE] |
| 117 | +> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers |
| 118 | +> implementation on its own if the model is not natively supported in vLLM. |
| 119 | +
|
| 120 | +Or for a custom model from the Hugging Face Hub: |
| 121 | + |
| 122 | +```py |
| 123 | +llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code=True) |
| 124 | +``` |
| 125 | + |
| 126 | +This backend acts as a **bridge**, marrying transformers’ plug-and-play flexibility with vLLM’s |
| 127 | +inference prowess. You get the best of both worlds: rapid prototyping with transformers |
| 128 | +and optimized deployment with vLLM. |
| 129 | + |
| 130 | +## Case Study: Helium |
| 131 | + |
| 132 | +[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines. |
| 133 | + |
| 134 | +Let’s see this in action: |
| 135 | + |
| 136 | +```bash |
| 137 | +vllm serve kyutai/helium-1-preview-2b --model-impl transformers |
| 138 | +``` |
| 139 | + |
| 140 | +Query it with the OpenAI API: |
| 141 | + |
| 142 | +```py |
| 143 | +from openai import OpenAI |
| 144 | + |
| 145 | +openai_api_key = "EMPTY" |
| 146 | +openai_api_base = "http://localhost:8000/v1" |
| 147 | + |
| 148 | +client = OpenAI( |
| 149 | + api_key=openai_api_key, |
| 150 | + base_url=openai_api_base, |
| 151 | +) |
| 152 | + |
| 153 | +completion = client.completions.create(model="kyutai/helium-1-preview-2b", prompt="What is AI?") |
| 154 | +print("Completion result:", completion) |
| 155 | +``` |
| 156 | + |
| 157 | +Here, vLLM efficiently processes inputs, leveraging the transformers backend to load |
| 158 | +`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers, |
| 159 | +vLLM delivers lower latency and better resource utilization. |
| 160 | + |
| 161 | +By pairing Transformers’ model ecosystem with vLLM’s inference optimizations, you unlock a workflow |
| 162 | +that’s both flexible and scalable. Whether you’re prototyping a new model, deploying a custom |
| 163 | +creation, or scaling a multimodal app, this combination accelerates your path from research to production. |
0 commit comments