Eval bug: Quad P40 unable to run 70B models on recent releases #12990

FullstackSensei · 2025-04-17T03:40:23Z

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5145 (12b17501)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Quad Nvidia Tesla P40 on dual Xeon E5-2699v4 (two cards per CPU)

Models

Llama-3.3-70B-Instruct-GGUF
Qwen2.5-72B-Instruct-GGUF
gemma-3-27b-it-Q8_0.gguf
QwQ-32B-Q8_0.gguf

Problem description & steps to reproduce

I updated and built llama.cpp after sticking with the same version for a couple of months, and since then Llama 3.3 70B or Qwen 2.5 72B. llama-server also fails to generate output after starting on smaller models on two cards only like Gemma 3 27B, Mistral Small 24B, Qwen 2.5 Coder 32B, and QwQ 32B.

If I run the 27-32B models on CUDA0 and CUDA1 they invariably fail, but generation works (mostly) fine with the following combination:
CUDA0,CUDA2
CUDA0,CUDA3
CUDA1,CUDA2
CUDA1,CUDA3
CUDA2,CUDA3

When this happens, nvtop shows GPU load on one GPU only for the smaller models that I configure to run on two GPUs, and on two GPUs only for the larger models that are configured to run on all four.

The worst part is that once this happens, llama.cpp is unable to initialize CUDA devices until I reboot the server. If I run llama-cli after this happens, I get the following:

llama-cli
ggml_cuda_init: failed to initialize CUDA: unknown error
build: 5145 (12b17501) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
gguf_init_from_file: failed to open GGUF file 'models/7B/ggml-model-f16.gguf'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/7B/ggml-model-f16.gguf

llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/7B/ggml-model-f16.gguf'
main: error: unable to load model

Meanwhile, nvidia-smi and nvtop continue to work normally when this happens, without reboot.

I don't remember the exact version I was running before, so I checked out b4686 from February (I think I was on b45xx) and recompiled, and indeed 70B models work without issue. I deleted the build directory, and configured and built again. To confirm, I run llama-cli after building:

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4686 (7b891bdc)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

I run the same llama-server command:

llama-server -m llama-server -m /models/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q8_0-00001-of-00002.gguf  \
-fa -sm row --no-mmap \
-ngl 99 -ngld 99 --port 9002 -c 10000 \
--device CUDA0,CUDA1,CUDA2,CUDA3 --tensor-split 1,1,1,1 \
--slots --metrics --numa distribute -t 40

and generation worked fine.

I checked out b5145 (been trying since b5131), recompiled as described below, confirmed version with llama-cli --version, and run Llama 3.3 70B using the the same command above. In the time it took me to type all this, this is all the output I got from llama-server:

",H@2C%#6H<+$D+A'FD8CG1F8#.H7)'%8#<H(#9'#.)A932+C7%/4==E$3/C".5;33

Compile

cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DLLAMA_CURL=OFF -DCMAKE_CXX_FLAGS="-O3 -flto" -DCMAKE_C_FLAGS="-O3 -flto"
cmake --build build --config Release -j 80

First Bad Commit

No response

Relevant log output

Sometimes I get one of the error messages indicated below, other times there are no error messages, but to be honest I'm not keeping track and the errors or lack could be related to the version I'm using since b5131. I have tried at least two tags a day for the past 3 days.

Sometimes, I get the fllowing:
~/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
  current device: 0, in function launch_fattn at /home/ali/llama.cpp/ggml/src/ggml-cuda/template-instances/../fattn-common.cuh:870
  cudaGetLastError()

Other times, I get the following error:

/home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
  current device: 0, in function alloc at /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:472
  cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1)

The text was updated successfully, but these errors were encountered:

segmond · 2025-04-17T14:23:38Z

Did you git fetch/pull before rebuilding? If so, I will encourage you to delete the directory and fetch a fresh pull from github. If you keep having the issue, try disabling fa and sm row to see if one of those options is triggering it? Does a smaller model like 8B llama cause the same issue? If so, I can try it later tonight when I get home, I have 3 P40s. If it keeps breaking, then try and bisect which branch the bug came in.

FullstackSensei · 2025-04-17T15:03:28Z

I spent several hours trying to narrow it down this morning. I tried several tags, always doing a --reset HARD before checking out a tag. The following tests were done on b5146 after shutting down and powering on the server to make sure nothing was lingering in memory. I installed Nvidia DCGM and run dcgmi diag -r 4 and all tests passed without issue (including stress testing VRAM).

Switched from llama-server to llama-cli to test things a bit faster, stopped installing built binaries and even deleted libllama.so from . All testing done today was straight from /build-tag/bin.

Haven't tried with 8B, but tested Gemma-3-27B0Q8, Qwen-2.5-Coder-32B-Q8, and QwQ-32B-Q8 each split across all combinations of two and three cards (including permutations of which device comes first):

CUDA0,CUDA1 doesn't work.
CUDA1,CUDA2 works.
CUDA0,CUDA2 works.
CUDA0,CUDA3 works.
CUDA1,CUDA2 works
CUDA1,CUDA3 works
CUDA0,CUDA1,CUDA2 doesn't work.
CUDA1,CUDA,2,CUDA3 works.
CUDA0,CUDA1,CUDA2,CUDA3 doesn't work.

I don't know if the shutdown or updating to b5146 changed something, but these results are very repeatable. I do not get any error messages with llama-cli as I did with llama-server, but I also haven't had to restart the server once due to CUDA initialization errors.

Checked the device tree, and CUDA0 and CUDA1 are on one socket, and CUDA2 and CUDA3 are on the other socket.

This is the llama-cli command I'm running. Just changing --device and --tensor-split (always setting used device to 1, and unused to 0) based on the sequences described above.

./llama-cli -m llama-server -m /models/Qwen2.5-Coder-32B-Instruct-128K-GGUF/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf -fa -sm row --no-mmap -ngl 99 -c 1000 --device CUDA0,CUDA1,CUDA3 --tensor-split 1,1,1,0 --numa distribute -t 40 --
no-warmup -p "you are a helpful assistant"

I'll grab a fresh copy of the source in a new directory tonight and repeat my tests. In the meantime, please let me know if there's anything more specific I could help with. Really appreciate the help!!!

JohannesGaessler · 2025-04-17T15:27:19Z

Please do a git bisect and identify the exact commit that introduced the problem.

FullstackSensei · 2025-04-17T22:17:45Z

@JohannesGaessler Thanks for mentioning git bisect. I didn't know this existed and will definitely use it for work going forward.

I was doing a manual binary search this morning, but the process was quite tedious as it often required restarting as I get "ggml_cuda_init: failed to initialize CUDA: unknown error" when this happens. I can prevent this from happening if I CTRL-C quickly when I see inference is not working correctly (only one GPU would spike in load in nvtop). I wouldn't even know how to detect this in an automated way :\

FullstackSensei added the bug-unconfirmed label Apr 17, 2025

FullstackSensei changed the title ~~Eval bug: Unable to run Llama 3. 70B or Nemotron 3.1 70B on recent releases~~ Eval bug: Quad P40 unable to run 70B models on recent releases Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

FullstackSensei commented Apr 17, 2025 •

edited

Loading

segmond commented Apr 17, 2025

FullstackSensei commented Apr 17, 2025 •

edited

Loading

JohannesGaessler commented Apr 17, 2025

FullstackSensei commented Apr 17, 2025

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

Comments

FullstackSensei commented Apr 17, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Compile

First Bad Commit

Relevant log output

segmond commented Apr 17, 2025

FullstackSensei commented Apr 17, 2025 • edited Loading

JohannesGaessler commented Apr 17, 2025

FullstackSensei commented Apr 17, 2025

FullstackSensei commented Apr 17, 2025 •

edited

Loading

FullstackSensei commented Apr 17, 2025 •

edited

Loading