Misc. bug: Vulkan performance depends on thread priority #12976

jeffbolznv · 2025-04-16T14:13:31Z

Name and Version

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 1

Problem description & steps to reproduce

I've noticed recently that ggml-vulkan performance depends more on thread/process priority than expected. For example, comparing normal priority to above_normal:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         95.62 ± 1.50 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         95.18 ± 1.40 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         94.45 ± 0.74 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         99.95 ± 0.08 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         99.98 ± 0.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.30 ± 0.14 |

Performance is also noticeably more variable with default priority. I think this is related to CPU latency after waiting on a fence, and I thought I had improved this with #12630, but it seems to be back and I don't understand why. I don't think it's related to driver version. I kind of suspect OS update, but I'm not sure.

I'd like to crowdsource some data on what systems this affects. If folks could please try these or similar command lines, and report CPU, GPU, driver version, and OS version (for windows please run winver and report the OS build number), I'd appreciate it. These results are on core i9-14900k, RTX 4070, driver 576.02, Windows 11 24H2 OS Build 26100.3775.

First Bad Commit

No response

Relevant log output

The text was updated successfully, but these errors were encountered:

henk717 · 2025-04-16T15:03:32Z

In the KobolsCpp userbase we have seen this across all backends cuda included for systems that have ecores. The thread priority change prevents those from taking the lead which results in the signicant speedup.

mdavey · 2025-04-16T15:17:35Z

Nothing interesting here. No e-cores on either system either.

Linux / AMD 6900HX / AMD RX 6600M

./llama-bench -m ../../../llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (RADV NAVI23) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.30 ± 0.10 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.23 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.18 ± 0.02 |

build: b43d89e3 (5143)

./llama-bench -m ../../../llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (RADV NAVI23) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.01 ± 0.23 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.21 ± 0.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         48.32 ± 0.11 |

build: b43d89e3 (5143)

Windows 10 / AMD 1700X / Nvidia GTX 1080

Windows 10 - 22H2 19045.5371
Nvidia Driver: 560.94

.\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1080 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.44 ± 0.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.28 ± 0.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.35 ± 0.13 |

build: b43d89e3 (5143)

.\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1080 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.42 ± 0.08 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.20 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan,RPC |  99 |         tg128 |         40.09 ± 0.33 |

build: b43d89e3 (5143)

jeffbolznv · 2025-04-16T15:19:45Z

In the KobolsCpp userbase we have seen this across all backends cuda included for systems that have ecores. The thread priority change prevents those from taking the lead which results in the signicant speedup.

I've tried setting processor affinity in task manager (which doesn't say which cores are which) and wasn't able to see a clear correlation to a specific set of cores. But my CPU does have e-cores (i9 14900k).

acbits · 2025-04-16T18:00:05Z

In the KobolsCpp userbase we have seen this across all backends cuda included for systems that have ecores. The thread priority change prevents those from taking the lead which results in the signicant speedup.

I've tried setting processor affinity in task manager (which doesn't say which cores are which) and wasn't able to see a clear correlation to a specific set of cores. But my CPU does have e-cores (i9 14900k).

Well, you could disable the e-cores in BIOS and repeat the test to gather more data.

0cc4m · 2025-04-16T18:45:35Z

Also no difference for me on AMD EPYC 7302, RTX 3090, Ubuntu 24.04 Server.

My guess would be either e-core scheduling or Windows background tasks.

» build_vk/bin/llama-bench -m ~/koboldcpp/models/llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128                                                                                                                                                         139 ↵
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        115.24 ± 2.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        114.38 ± 0.89 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        114.67 ± 1.65 |

build: 015022bb (5144)

» build_vk/bin/llama-bench -m ~/koboldcpp/models/llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        114.63 ± 2.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        114.78 ± 2.53 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        114.59 ± 3.09 |

build: 015022bb (5144)

jeffbolznv · 2025-04-16T20:29:40Z

Well, you could disable the e-cores in BIOS and repeat the test to gather more data.

Good idea. The e-cores appear to be partially to blame, but not entirely. With e-cores disabled in the BIOS:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         97.55 ± 1.51 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         96.88 ± 0.66 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         98.03 ± 0.27 |

build: b43d89e3 (5143)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 2
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.94 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.92 ± 0.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.85 ± 0.10 |

netrunnereve · 2025-04-16T21:42:36Z

I'm not seeing any differences as well on my computer with a Xeon 1240 v2, RX470, and Ubuntu 24.

Good idea. The e-cores appear to be partially to blame, but not entirely.

You can also try to lock the clock speed on the CPU just in case the different scheduling modes affect how it does power saving and frequency scaling.

acbits · 2025-04-16T22:14:47Z

Well, you could disable the e-cores in BIOS and repeat the test to gather more data.

Good idea. The e-cores appear to be partially to blame, but not entirely. With e-cores disabled in the BIOS:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         97.55 ± 1.51 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         96.88 ± 0.66 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         98.03 ± 0.27 |

build: b43d89e3 (5143)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\llama-2-7b.Q4_0.gguf -p 0 -n 128,128,128 --prio 2
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.94 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.92 ± 0.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |        100.85 ± 0.10 |

You could try turning off ACPI in BIOS as the next step. Your machine might run hot, but for a test, it should be OK.

jeffbolznv · 2025-04-17T15:11:06Z

My BIOS doesn't have an option to entirely disable ACPI, so I tried various CPU enablement/power management options instead. I think I had e-cores disabled for all of these experiments.

These two options both gave full performance with default priority:

disable turbo boost max technology 3.0
disable C-states

These options did not restore performance:

disable "turbo mode" (looks like this disabled overclocking)
only enable the two CPU cores that are fused to the highest max clock
disable "enhanced" C-states

I'm not sure what to conclude from this, turbo boost is pretty different from C-states, though maybe the boosted cores are less likely to go into a C-state or something like that.

max-krasnyansky · 2025-04-17T16:39:49Z

Folks can you give this #12995 a shot on your setups.
Recent Windows on ARM64 builds started parking (offlining) the CPU cores more aggressively.
Instead of changing OS/BIOS settings we should be able to just tell the scheduler that our threads should not be throttled.
That PR I just started might help in your case as well.

jeffbolznv · 2025-04-17T17:08:10Z

Thanks. It doesn't work for me as-is because the SetThreadInformation call isn't executed - I assume that's just for the thread pool?

I tried the same call from ggml_vk_instance_init and it does restore the performance.

Another thing I've tried that restores performance is to temporarily bump the thread priority while waiting on the fence (master...jeffbolznv:llama.cpp:thread_priority). Application developers often don't like libraries to override thread/process state, so this has the benefit of being transient and shouldn't impact scheduling in a meaningful way.

max-krasnyansky · 2025-04-17T20:52:50Z

Thanks. It doesn't work for me as-is because the SetThreadInformation call isn't executed - I assume that's just for the thread pool?

Yep. Threadpool and OMP. I keep forgetting that we don't use it in other backends.

I tried the same call from ggml_vk_instance_init and it does restore the performance.

Perfect!

Another thing I've tried that restores performance is to temporarily bump the thread priority while waiting on the fence (master...jeffbolznv:llama.cpp:thread_priority). Application developers often don't like libraries to override thread/process state, so this has the benefit of being transient and shouldn't impact scheduling in a meaningful way.

Yep, that was the "hack" I've been using for Windows on ARM64 benchmarks as well.
It's a bit less robust though. Thread Info setting is better for Windows.

--
Re: Overriding process/thread state. @slaren, @ggerganov and I discussed this a bit in the threadpool PRs.
Ideally, we should not run any graph compute processing in the main/app thread. It'd be better to just always start a thread per ctx/graph even in the single-threaded mode. That way we can setup its priority/affinity/etc and not have to worry about messing with app threads. Perhaps, this is a good time to go ahead and introduce this.

ggerganov · 2025-04-19T08:44:59Z

Re: Overriding process/thread state. @slaren, @ggerganov and I discussed this a bit in the threadpool PRs.
Ideally, we should not run any graph compute processing in the main/app thread. It'd be better to just always start a thread per ctx/graph even in the single-threaded mode. That way we can setup its priority/affinity/etc and not have to worry about messing with app threads. Perhaps, this is a good time to go ahead and introduce this.

IIUC, the idea of doing this is to make it slightly more convenient for the user of libllama / libggml, assuming that a separate thread doing the compute is always the better option. I still think it's better to keep both options available - with the current implementation the user can choose to either create the thread themselves, or for some reason, keep the compute in the application thread. I suppose for some resource-limited / embedded environments, the later might have some advantages.

jeffbolznv added the bug-unconfirmed label Apr 16, 2025

jeffbolznv self-assigned this Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Vulkan performance depends on thread priority #12976

Misc. bug: Vulkan performance depends on thread priority #12976

jeffbolznv commented Apr 16, 2025 •

edited

Loading

henk717 commented Apr 16, 2025

mdavey commented Apr 16, 2025

jeffbolznv commented Apr 16, 2025

acbits commented Apr 16, 2025

0cc4m commented Apr 16, 2025 •

edited

Loading

jeffbolznv commented Apr 16, 2025

netrunnereve commented Apr 16, 2025 •

edited

Loading

acbits commented Apr 16, 2025

jeffbolznv commented Apr 17, 2025

max-krasnyansky commented Apr 17, 2025

jeffbolznv commented Apr 17, 2025

max-krasnyansky commented Apr 17, 2025

ggerganov commented Apr 19, 2025

Misc. bug: Vulkan performance depends on thread priority #12976

Misc. bug: Vulkan performance depends on thread priority #12976

Comments

jeffbolznv commented Apr 16, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

henk717 commented Apr 16, 2025

mdavey commented Apr 16, 2025

Linux / AMD 6900HX / AMD RX 6600M

Windows 10 / AMD 1700X / Nvidia GTX 1080

jeffbolznv commented Apr 16, 2025

acbits commented Apr 16, 2025

0cc4m commented Apr 16, 2025 • edited Loading

jeffbolznv commented Apr 16, 2025

netrunnereve commented Apr 16, 2025 • edited Loading

acbits commented Apr 16, 2025

jeffbolznv commented Apr 17, 2025

max-krasnyansky commented Apr 17, 2025

jeffbolznv commented Apr 17, 2025

max-krasnyansky commented Apr 17, 2025

ggerganov commented Apr 19, 2025

jeffbolznv commented Apr 16, 2025 •

edited

Loading

0cc4m commented Apr 16, 2025 •

edited

Loading

netrunnereve commented Apr 16, 2025 •

edited

Loading