Replies: 1 comment 1 reply
-
mul mat q is enabled by compiling with cublas and using the command --mul-mat-q in CLI. In the latest llama.cpp versions (not merged yet) mul mat q is the default, so the command no longer works. And yes, its faster and saves quite a lot of VRAM. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am compiling as per the readme for cuBlas but would like to try mul_mat_q kernels to compare speeds. From what I gather these kernels are implemented using openblas?
Does this mean I have to separately compile a llama-cpp-python for each backend and uninstall them in between? Or can I compile one backend with both cuBlas and openblas?
Will mul_mat_q flag also work with cublas compiled only?
Beta Was this translation helpful? Give feedback.
All reactions