-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Eval bug: Quad P40 unable to run 70B models on recent releases #12990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did you git fetch/pull before rebuilding? If so, I will encourage you to delete the directory and fetch a fresh pull from github. If you keep having the issue, try disabling fa and sm row to see if one of those options is triggering it? Does a smaller model like 8B llama cause the same issue? If so, I can try it later tonight when I get home, I have 3 P40s. If it keeps breaking, then try and bisect which branch the bug came in. |
I spent several hours trying to narrow it down this morning. I tried several tags, always doing a --reset HARD before checking out a tag. The following tests were done on b5146 after shutting down and powering on the server to make sure nothing was lingering in memory. I installed Nvidia DCGM and run Switched from llama-server to llama-cli to test things a bit faster, stopped installing built binaries and even deleted libllama.so from . All testing done today was straight from /build-tag/bin. Haven't tried with 8B, but tested Gemma-3-27B0Q8, Qwen-2.5-Coder-32B-Q8, and QwQ-32B-Q8 each split across all combinations of two and three cards (including permutations of which device comes first):
I don't know if the shutdown or updating to b5146 changed something, but these results are very repeatable. I do not get any error messages with llama-cli as I did with llama-server, but I also haven't had to restart the server once due to CUDA initialization errors. Checked the device tree, and CUDA0 and CUDA1 are on one socket, and CUDA2 and CUDA3 are on the other socket. This is the llama-cli command I'm running. Just changing --device and --tensor-split (always setting used device to 1, and unused to 0) based on the sequences described above.
I'll grab a fresh copy of the source in a new directory tonight and repeat my tests. In the meantime, please let me know if there's anything more specific I could help with. Really appreciate the help!!! |
Please do a git bisect and identify the exact commit that introduced the problem. |
@JohannesGaessler Thanks for mentioning git bisect. I didn't know this existed and will definitely use it for work going forward. I was doing a manual binary search this morning, but the process was quite tedious as it often required restarting as I get "ggml_cuda_init: failed to initialize CUDA: unknown error" when this happens. I can prevent this from happening if I CTRL-C quickly when I see inference is not working correctly (only one GPU would spike in load in nvtop). I wouldn't even know how to detect this in an automated way :\ |
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
Quad Nvidia Tesla P40 on dual Xeon E5-2699v4 (two cards per CPU)
Models
Llama-3.3-70B-Instruct-GGUF
Qwen2.5-72B-Instruct-GGUF
gemma-3-27b-it-Q8_0.gguf
QwQ-32B-Q8_0.gguf
Problem description & steps to reproduce
I updated and built llama.cpp after sticking with the same version for a couple of months, and since then Llama 3.3 70B or Qwen 2.5 72B. llama-server also fails to generate output after starting on smaller models on two cards only like Gemma 3 27B, Mistral Small 24B, Qwen 2.5 Coder 32B, and QwQ 32B.
If I run the 27-32B models on CUDA0 and CUDA1 they invariably fail, but generation works (mostly) fine with the following combination:
CUDA0,CUDA2
CUDA0,CUDA3
CUDA1,CUDA2
CUDA1,CUDA3
CUDA2,CUDA3
When this happens, nvtop shows GPU load on one GPU only for the smaller models that I configure to run on two GPUs, and on two GPUs only for the larger models that are configured to run on all four.
The worst part is that once this happens, llama.cpp is unable to initialize CUDA devices until I reboot the server. If I run llama-cli after this happens, I get the following:
Meanwhile, nvidia-smi and nvtop continue to work normally when this happens, without reboot.
I don't remember the exact version I was running before, so I checked out b4686 from February (I think I was on b45xx) and recompiled, and indeed 70B models work without issue. I deleted the build directory, and configured and built again. To confirm, I run llama-cli after building:
I run the same llama-server command:
and generation worked fine.
I checked out b5145 (been trying since b5131), recompiled as described below, confirmed version with llama-cli --version, and run Llama 3.3 70B using the the same command above. In the time it took me to type all this, this is all the output I got from llama-server:
",H@2C%#6H<+$D+A'FD8CG1F8#.H7)'%8#<H(#9'#.)A932+C7%/4==E$3/C".5;33
Compile
First Bad Commit
No response
Relevant log output
Sometimes I get one of the error messages indicated below, other times there are no error messages, but to be honest I'm not keeping track and the errors or lack could be related to the version I'm using since b5131. I have tried at least two tags a day for the past 3 days.
The text was updated successfully, but these errors were encountered: