Skip to content

Support GGUF quantized models #413

Open
@bartekupartek

Description

@bartekupartek

bumblebee supports only hardcoded hugging face models, I found that ie llama 3.2 might be 2x and with 60% less memory footprint when using quantized version, and unsloth do it so well: https://huggingface.co/unsloth
I found GH Issue 376 but it doesn't fully answer if this is possible or what's the problem, so might it be possible by bumblebee?
Currently I'm using only official repos like llama 3.2 but it's hard to fit more than one model on single GPU, but still love Elixir to interact with models over liveview without coupling to Python.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions