Open
Description
bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found GH Issue 376 but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee?
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.
Metadata
Metadata
Assignees
Labels
No labels