Skip to content

v0.8.0

Compare
Choose a tag to compare
@github-actions github-actions released this 18 Mar 17:52
· 622 commits to main since this release

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

  • MLA Enhancements:
  • Distributed Expert Parallelism (EP) and Data Parallelism (DP)
    • EP Support for DeepSeek Models (#12583)
    • Add enable_expert_parallel arg (#14305)
    • EP/TP MoE + DP Attention (#13931)
    • Set up data parallel communication (#13591)
  • MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
  • Pipeline Parallelism:
    • DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
    • Improve pipeline partitioning (#13839)
  • GEMM
    • Add streamK for block-quantized CUTLASS kernels (#12978)
    • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
    • Add more tuned configs for H20 and others (#14877)

New Models

  • Gemma 3 (#14660)
    • Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
  • Mistral Small 3.1 (#14957)
  • Phi-4-multimodal-instruct (#14119)
  • Grok1 (#13795)
  • QwQ-32B and toll calling (#14479, #14478)
  • Zamba2 (#13185)

NVIDIA Blackwell

  • Support nvfp4 cutlass gemm (#13571)
  • Add cutlass support for blackwell fp8 gemm (#13798)
  • Update the flash attn tag to support Blackwell (#14244)
  • Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

  • The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
  • The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
  • vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
  • Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

  • Update to PyTorch 2.6.0 (#12721, #13860)
  • Update to Python 3.9 typing (#14492, #13971)
  • Update to CUDA 12.4 as default for release and nightly wheels (#12098)
  • Update to Ray 2.43 (#13994)
  • Upgrade aiohttp to incldue CVE fix (#14840)
  • Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

  • API Server
    • Support return_tokens_as_token_id as a request param (#14066)
    • Support Image Emedding as input (#13955)
    • New /load endpoint for load statistics (#13950)
    • New API endpoint /is_sleeping (#14312)
    • Enables /score endpoint for embedding models (#12846)
    • Enable streaming for Transcription API (#13301)
    • Make model param optional in request (#13568)
    • Support SSL Key Rotation in HTTP Server (#13495)
  • Reasoning
    • Support reasoning output (#12955)
    • Support outlines engine with reasoning outputs (#14114)
    • Update reasoning with stream example to use OpenAI library (#14077)
  • CLI
    • Ensure out-of-tree quantization method recognize by cli args (#14328)
    • Add vllm bench CLI (#13993)
  • Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

  • Support KV cache offloading and disagg prefill with LMCache connector (#12953)
  • Support chunked prefill for LMCache connector (#14505)

LoRA

  • Add LoRA support for TransformersModel (#13770)
  • Make the deviceprofilerinclude LoRA memory. (#14469)
  • Gemma3ForConditionalGeneration supports LoRA (#14797)
  • Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

  • Generalized prompt updates for multi-modal processor (#13964)
  • Deprecate legacy input mapper for OOT multimodal models (#13979)
  • Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

  • BaiChuan SupportsQuant (#13710)
  • BartModel SupportsQuant (#14699)
  • Bamba SupportsQuant (#14698)
  • Deepseek GGUF support (#13167)
  • GGUF MoE kernel (#14613)
  • Add GPTQAllSpark Quantization (#12931)
  • Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

  • xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

  • Faster Custom Paged Attention kernels (#12348)
  • Improved performance for V1 Triton (ROCm) backend (#14152)
  • Chunked prefill/paged attention in MLA on ROCm (#14316)
  • Perf improvement for DSv3 on AMD GPUs (#13718)
  • MoE fp8 block quant tuning support (#14068)

TPU

  • Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
  • Support start_profile/stop_profile in TPU worker (#13988)
  • Add TPU v1 test (#14834)
  • TPU multimodal model support for ragged attention (#14158)
  • Add tensor parallel support via Ray (#13618)
  • Enable prefix caching by default (#14773)

Neuron

  • Add Neuron device communicator for vLLM v1 (#14085)
  • Add custom_ops for neuron backend (#13246)
  • Add reshape_and_cache (#14391)
  • Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

  • Upgrade CPU backend to torch-2.6 (#13381)
  • Support FP8 KV cache in CPU Backend(#14741)

s390x

  • Adding cpu inference with VXE ISA for s390x architecture (#12613)
  • Add documentation for s390x cpu implementation (#14198)

Plugins

  • Remove cuda hard code in models and layers (#13658)
  • Move use allgather to platform (#14010)

Bugfix and Enhancements

  • Illegal memory access for MoE On H20 (#13693)
  • Fix FP16 overflow for DeepSeek V2 (#13232)
  • Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
  • Pass all driver env vars to ray workers unless excluded (#14099)
  • Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
  • Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

  • Consolidate performance benchmark datasets (#14036)
  • Update benchmarks README (#14646)

CI and Build

  • Add RELEASE.md (#13926)
  • Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

  • Add RLHF document (#14482)
  • Add nsight guide to profiling docs (#14298)
  • Add K8s deployment guide (#14084)
  • Add developer documentation for torch.compile integration (#14437)

What's Changed

New Contributors

Full Changelog: v0.7.3...v0.8.0