Release v0.5.0 · EricLBuehler/mistral.rs

Highlights

Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0

Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.

Support for many more models:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
Native tool calling support for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
Tensor Parallelism support (NCCL)!
FlashAttention V3 support and integration in PagedAttention
30x reduction in ISQ times on Metal!
Revamped prefix cacher system

What's Changed

Allow using library in CurrentThread runtime by @sgrebnov in #1082
Improve accuracy of uqff auto device map by @EricLBuehler in #1084
DeepSeekV3 sigmoid support by @EricLBuehler in #1092
GPU-accelerated sampling (+5% decode perf) by @EricLBuehler in #1094
Fix missing perceiver_config in qwen2vl by @EricLBuehler in #1096
More topk methods for deepseek 2/3 by @EricLBuehler in #1097
More accurate layer size computation for deepseek 2/3 by @EricLBuehler in #1098
Improve streaming UX by @EricLBuehler in #1102
Faster fp8 blockwise dequant by @EricLBuehler in #1100
DS2/3 paged attn by @EricLBuehler in #1103
Faster bincount by @EricLBuehler in #1104
PagedAttention prompt chunking support by @EricLBuehler in #1105
Refactor server SSE by @EricLBuehler in #1107
PagedAttention + FlashAttention (and FlashAttention V3) by @EricLBuehler in #1109
Take KEEP_ALIVE_INTERVAL into account by @EricLBuehler in #1111
Refactor enable of flash attn by @EricLBuehler in #1110
Fix imatrix isq quantize_onto by @EricLBuehler in #1112
Tensor parallelism and pipeline parallelism by @EricLBuehler in #1113
Bump openssl from 0.10.69 to 0.10.70 by @dependabot in #1121
Allow chat streaming to use tools by @Jeadie in #1088
New file format for imatrix: .cimatrix by @EricLBuehler in #1004
Fix isq with bias for column parallel by @EricLBuehler in #1128
Multi-node support for tensor parallelism by @EricLBuehler in #1125
Add an NCCL feature flag by @EricLBuehler in #1129
Fix mistral 2501 gguf by @EricLBuehler in #1131
Add jinja strftime_now function by @EricLBuehler in #1132
Multiple models multi node by @EricLBuehler in #1136
Remove unexpected cp behavior by @jncraton in #1141
Revamp speculative decoding! by @EricLBuehler in #1027
Fuse MLP mul-and-act by @EricLBuehler in #1142
Short-circuit dry sampling: +6% T/s by @EricLBuehler in #1143
Integrate fused MLP mul-act for more models! by @EricLBuehler in #1144
Use cudarc 0.13.5 by @EricLBuehler in #1145
Handle HF_HUB_CACHE env var by @EricLBuehler in #1146
FlashAttention V2/V3 metadata with support for device location by @EricLBuehler in #1148
FP8 blockwise dequant cuda kernel by @EricLBuehler in #1149
Blockwise FP8 CUDA for cc < 800 by @EricLBuehler in #1150
Fix chat sampling response by @EricLBuehler in #1154
Multiple processes for TP by @EricLBuehler in #1152
Ensure we do not bind the port for daemon processes by @EricLBuehler in #1158
Handle CUDA_NVCC_FLAGS in flash attn v3 by @EricLBuehler in #1160
build fix for arm. by @jamesvren in #1164
Working PrefixCacherV2! by @EricLBuehler in #1168
Implement Phi-4 Multimodal! by @EricLBuehler in #1163
No extra split/cat pair in rope by @EricLBuehler in #1169
Remove gpu<>cpu sync for faster long-context by @EricLBuehler in #1170
Refactor NCCL device mappers by @EricLBuehler in #1172
Bump ring from 0.17.11 to 0.17.13 by @dependabot in #1179
DSV3/R1 fixes by @EricLBuehler in #1173
Fix diffusion device mapping by @EricLBuehler in #1187
Internal abstraction for distributed op by @EricLBuehler in #1188
Make Sequence::set_toks more safe by @EricLBuehler in #1190
Fix CI tests out of storage? by @EricLBuehler in #1191
Internal abstraction for distributed op by @EricLBuehler in #1189
Fix build_cuda_all.yaml CI by @EricLBuehler in #1193
Support tensor parallelism for vision models! by @EricLBuehler in #1194
Always pass _USE_MATH_DEFINES for CUDA by @EricLBuehler in #1195
Remove matmul via f16 framework by @EricLBuehler in #1196
Remove API for matmul_via_f16 by @EricLBuehler in #1197
Add UQFF text/vision model API by @EricLBuehler in #1198
Complete qwen2_5_vl, and some fixes by @brrr in #1184
Implement Gemma 3 by @EricLBuehler in #1201
Add Gemma 3 vision support! by @EricLBuehler in #1202
Manually fixup sentencepiece detok by @EricLBuehler in #1204
More vision models with TP by @EricLBuehler in #1200
Fix topology link in the docs by @etiennebalit in #1205
Gemma3 1b support and optimized rotating cache by @EricLBuehler in #1206
Improve rotating kv cache, prefix cacher system by @EricLBuehler in #1207
Better handling for kvcache set_len by @EricLBuehler in #1208
Update deps and use rand 0.9 by @EricLBuehler in #1210
Update hf hub dep, add initial blockwise fp8 GEMM tests by @EricLBuehler in #1212
Growable RotatingKvCache and fixes for Phi-4 mini by @EricLBuehler in #1215
Gemma 3 cuda fixes by @EricLBuehler in #1217
Add pydantic schema examples! by @EricLBuehler in #1219
Sliding window attention fixes by @EricLBuehler in #1220
adapt to rig crate as client by @benliao in #1214
Implement Mistral 3! by @EricLBuehler in #1221
Metal SDPA with masking by @EricLBuehler in #1225
Send [DONE] SSE chunk per openai spec by @EricLBuehler in #1226
Fix handling of device when compiled for but disabled nccl by @EricLBuehler in #1227
Fix nccl blocking case by @EricLBuehler in #1228
Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by @EricLBuehler in #1229
OpenAI API compatability fixes by @EricLBuehler in #1230
[Breaking] Automatic server logging by @EricLBuehler in #1231
Use default stream for flash attn by @EricLBuehler in #1232
Bump version to 0.5.0 by @EricLBuehler in #1233

New Contributors

@sgrebnov made their first contribution in #1082
@jncraton made their first contribution in #1141
@jamesvren made their first contribution in #1164
@brrr made their first contribution in #1184
@etiennebalit made their first contribution in #1205
@benliao made their first contribution in #1214

Full Changelog: v0.4.0...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Highlights

What's Changed

New Contributors

Contributors