Skip to content

v0.8.3

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 06 Apr 04:11
· 120 commits to main since this release

Highlights

This release features 260 commits, 109 contributors, 38 new contributors.

  • We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
    • Please note that Llama4 is only supported in V1 engine only for now.
  • V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.

Cluster Scale Serving

  • Single node data parallel with API server support (#13923)
  • Multi-node offline DP+EP example (#15484)
  • Expert parallelism enhancements
    • CUTLASS grouped gemm fp8 MoE kernel (#13972)
    • Fused experts refactor (#15914)
    • Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
    • Adding support for fp8 gemm layer input in fp8 (#14578)
    • Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
  • Support XpYd disaggregated prefill with MooncakeStore (#12957)

Model Supports

V1 Engine

  • Collective RPC (#15444)
  • Faster top-k only implementation (#15478)
  • BitsAndBytes support (#15611)
  • Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)

Features

API

  • Support Enum for xgrammar based structured output in V1. (#15594, #15757)
  • A new tags parameter for wake_up (#15500)
  • V1 LoRA support CPU offload (#15843)
  • Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
  • Addition of http service metrics (#15657)

Performance

  • LoRA Scheduler optimization bridging V1 and V0 performance (#15422).

Hardwares

  • AMD:
    • Add custom allreduce support for ROCM (#14125)
    • Quark quantization documentation (#15861)
    • AITER integration: int8 scaled gemm kernel (#15433), fused moe (#14967)
    • Paged attention for V1 (#15720)
  • CPU:
  • TPU
    • Improve Memory Usage Estimation (#15671)
    • Optimize the all-reduce performance (#15903)
    • Support sliding window and logit soft capping in the paged attention kernel. (#15732)
    • TPU-optimized top-p implementation (avoids scattering). (#15736)

Doc, Build, Ecosystem

  • V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
  • Recommend developing with Python 3.12 in developer guide (#15811)
  • Clean up: move dockerfiles into their own directory (#14549)
  • Add minimum version for huggingface_hub to enable Xet downloads (#15873)
  • TPU CI: Add basic perf regression test (#15414)

What's Changed

New Contributors

Full Changelog: v0.8.2...v0.8.3