Support GGUF quantized models

bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found [GH Issue 376](https://github.com/elixir-nx/bumblebee/issues/376) but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee? 
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support GGUF quantized models #413

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support GGUF quantized models #413

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions