Skip to content

Files

Latest commit

 

History

History
81 lines (70 loc) · 2.54 KB

QUANTS.md

File metadata and controls

81 lines (70 loc) · 2.54 KB

Quantization in mistral.rs

Mistral.rs supports the following quantization:

  • GGUF/GGML
    • Q, K type
    • Supported in GGUF/GGML and GGUF/GGML adapter models
    • Supported in all plain/vision and adapter models
    • Imatrix quantization is supported
    • I quants coming!
    • CPU, CUDA, Metal (all supported devices)
    • 2, 3, 4, 5, 6, 8 bit
  • GPTQ
    • Supported in all plain/vision and adapter models
    • CUDA only
    • 2, 3, 4, 8 bit
    • Marlin kernel support in 4-bit and 8-bit.
  • HQQ
    • Supported in all plain/vision and adapter models via ISQ
    • 4, 8 bit
    • CPU, CUDA, Metal (all supported devices)
  • FP8
    • Supported in all plain/vision and adapter models
    • CPU, CUDA, Metal (all supported devices)
  • BNB
    • Supported in all plain/vision and adapter models
    • bitsandbytes int8, fp4, nf4 support
  • AFQ
    • 2, 3, 4, 6, 8 bit
    • 🔥 Designed to be fast on Metal!
    • Only supported on Metal.
  • ISQ
    • Supported in all plain/vision and adapter models
    • Works on all supported devices
    • Automatic selection to use the fastest and most accurate method
    • Supports:
      • Q, K type GGUF quants
      • AFQ
      • HQQ
      • FP8
  • MLX prequantized
    • Supported in all plain/vision and adapter models

Using a GGUF quantized model

  • Use the gguf (cli) / GGUF (Python) model selector
  • Provide the GGUF file
cargo run --features cuda -- -i gguf -f my-gguf-file.gguf

Using ISQ

See the docs

cargo run --features cuda -- -i --isq Q4K plain -m microsoft/Phi-3-mini-4k-instruct -a phi3

Using a GPTQ quantized model

  • Provide the model ID for the GPTQ model
  • Mistral.rs will automatically detect and use GPTQ quantization for plain and vision models!
  • The Marlin kernel will automatically be used for 4-bit and 8-bit.
cargo run --features cuda --release -- -i plain -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

You can create your own GPTQ model using [scripts/convert_to_gptq.py][../scripts/convert_to_gptq.py]:

pip install gptqmodel transformers datasets

python3 scripts/convert_to_gptq.py --src path/to/model --dst output/model/path --bits 4

Using a MLX prequantized model (on Metal)

  • Provide the model ID for the MLX prequantized model
  • Mistral.rs will automatically detect and use quantization for plain and vision models!
  • Specialized kernels will be used to accelerate inference!
cargo run --features ... --release -- -i plain -m mlx-community/Llama-3.8-1B-8bit