Highlights
Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0
Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.
- Support for many more models:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
- Native tool calling support for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
- Tensor Parallelism support (NCCL)!
- FlashAttention V3 support and integration in PagedAttention
- 30x reduction in ISQ times on Metal!
- Revamped prefix cacher system
What's Changed
- Allow using library in CurrentThread runtime by @sgrebnov in #1082
- Improve accuracy of uqff auto device map by @EricLBuehler in #1084
- DeepSeekV3 sigmoid support by @EricLBuehler in #1092
- GPU-accelerated sampling (+5% decode perf) by @EricLBuehler in #1094
- Fix missing perceiver_config in qwen2vl by @EricLBuehler in #1096
- More topk methods for deepseek 2/3 by @EricLBuehler in #1097
- More accurate layer size computation for deepseek 2/3 by @EricLBuehler in #1098
- Improve streaming UX by @EricLBuehler in #1102
- Faster fp8 blockwise dequant by @EricLBuehler in #1100
- DS2/3 paged attn by @EricLBuehler in #1103
- Faster bincount by @EricLBuehler in #1104
- PagedAttention prompt chunking support by @EricLBuehler in #1105
- Refactor server SSE by @EricLBuehler in #1107
- PagedAttention + FlashAttention (and FlashAttention V3) by @EricLBuehler in #1109
- Take KEEP_ALIVE_INTERVAL into account by @EricLBuehler in #1111
- Refactor enable of flash attn by @EricLBuehler in #1110
- Fix imatrix isq quantize_onto by @EricLBuehler in #1112
- Tensor parallelism and pipeline parallelism by @EricLBuehler in #1113
- Bump openssl from 0.10.69 to 0.10.70 by @dependabot in #1121
- Allow chat streaming to use tools by @Jeadie in #1088
- New file format for imatrix:
.cimatrix
by @EricLBuehler in #1004 - Fix isq with bias for column parallel by @EricLBuehler in #1128
- Multi-node support for tensor parallelism by @EricLBuehler in #1125
- Add an NCCL feature flag by @EricLBuehler in #1129
- Fix mistral 2501 gguf by @EricLBuehler in #1131
- Add jinja strftime_now function by @EricLBuehler in #1132
- Multiple models multi node by @EricLBuehler in #1136
- Remove unexpected cp behavior by @jncraton in #1141
- Revamp speculative decoding! by @EricLBuehler in #1027
- Fuse MLP mul-and-act by @EricLBuehler in #1142
- Short-circuit dry sampling: +6% T/s by @EricLBuehler in #1143
- Integrate fused MLP mul-act for more models! by @EricLBuehler in #1144
- Use cudarc 0.13.5 by @EricLBuehler in #1145
- Handle HF_HUB_CACHE env var by @EricLBuehler in #1146
- FlashAttention V2/V3 metadata with support for device location by @EricLBuehler in #1148
- FP8 blockwise dequant cuda kernel by @EricLBuehler in #1149
- Blockwise FP8 CUDA for cc < 800 by @EricLBuehler in #1150
- Fix chat sampling response by @EricLBuehler in #1154
- Multiple processes for TP by @EricLBuehler in #1152
- Ensure we do not bind the port for daemon processes by @EricLBuehler in #1158
- Handle CUDA_NVCC_FLAGS in flash attn v3 by @EricLBuehler in #1160
- build fix for arm. by @jamesvren in #1164
- Working PrefixCacherV2! by @EricLBuehler in #1168
- Implement Phi-4 Multimodal! by @EricLBuehler in #1163
- No extra split/cat pair in rope by @EricLBuehler in #1169
- Remove gpu<>cpu sync for faster long-context by @EricLBuehler in #1170
- Refactor NCCL device mappers by @EricLBuehler in #1172
- Bump ring from 0.17.11 to 0.17.13 by @dependabot in #1179
- DSV3/R1 fixes by @EricLBuehler in #1173
- Fix diffusion device mapping by @EricLBuehler in #1187
- Internal abstraction for distributed op by @EricLBuehler in #1188
- Make Sequence::set_toks more safe by @EricLBuehler in #1190
- Fix CI tests out of storage? by @EricLBuehler in #1191
- Internal abstraction for distributed op by @EricLBuehler in #1189
- Fix build_cuda_all.yaml CI by @EricLBuehler in #1193
- Support tensor parallelism for vision models! by @EricLBuehler in #1194
- Always pass _USE_MATH_DEFINES for CUDA by @EricLBuehler in #1195
- Remove matmul via f16 framework by @EricLBuehler in #1196
- Remove API for matmul_via_f16 by @EricLBuehler in #1197
- Add UQFF text/vision model API by @EricLBuehler in #1198
- Complete qwen2_5_vl, and some fixes by @brrr in #1184
- Implement Gemma 3 by @EricLBuehler in #1201
- Add Gemma 3 vision support! by @EricLBuehler in #1202
- Manually fixup sentencepiece detok by @EricLBuehler in #1204
- More vision models with TP by @EricLBuehler in #1200
- Fix topology link in the docs by @etiennebalit in #1205
- Gemma3 1b support and optimized rotating cache by @EricLBuehler in #1206
- Improve rotating kv cache, prefix cacher system by @EricLBuehler in #1207
- Better handling for kvcache set_len by @EricLBuehler in #1208
- Update deps and use rand 0.9 by @EricLBuehler in #1210
- Update hf hub dep, add initial blockwise fp8 GEMM tests by @EricLBuehler in #1212
- Growable RotatingKvCache and fixes for Phi-4 mini by @EricLBuehler in #1215
- Gemma 3 cuda fixes by @EricLBuehler in #1217
- Add pydantic schema examples! by @EricLBuehler in #1219
- Sliding window attention fixes by @EricLBuehler in #1220
- adapt to rig crate as client by @benliao in #1214
- Implement Mistral 3! by @EricLBuehler in #1221
- Metal SDPA with masking by @EricLBuehler in #1225
- Send [DONE] SSE chunk per openai spec by @EricLBuehler in #1226
- Fix handling of device when compiled for but disabled nccl by @EricLBuehler in #1227
- Fix nccl blocking case by @EricLBuehler in #1228
- Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by @EricLBuehler in #1229
- OpenAI API compatability fixes by @EricLBuehler in #1230
- [Breaking] Automatic server logging by @EricLBuehler in #1231
- Use default stream for flash attn by @EricLBuehler in #1232
- Bump version to 0.5.0 by @EricLBuehler in #1233
New Contributors
- @sgrebnov made their first contribution in #1082
- @jncraton made their first contribution in #1141
- @jamesvren made their first contribution in #1164
- @brrr made their first contribution in #1184
- @etiennebalit made their first contribution in #1205
- @benliao made their first contribution in #1214
Full Changelog: v0.4.0...v0.5.0