-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Misc. bug: Vulkan performance depends on thread priority #12976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the KobolsCpp userbase we have seen this across all backends cuda included for systems that have ecores. The thread priority change prevents those from taking the lead which results in the signicant speedup. |
Nothing interesting here. No e-cores on either system either. Linux / AMD 6900HX / AMD RX 6600M
Windows 10 / AMD 1700X / Nvidia GTX 1080Windows 10 - 22H2 19045.5371
|
I've tried setting processor affinity in task manager (which doesn't say which cores are which) and wasn't able to see a clear correlation to a specific set of cores. But my CPU does have e-cores (i9 14900k). |
Well, you could disable the e-cores in BIOS and repeat the test to gather more data. |
Also no difference for me on AMD EPYC 7302, RTX 3090, Ubuntu 24.04 Server. My guess would be either e-core scheduling or Windows background tasks.
|
Good idea. The e-cores appear to be partially to blame, but not entirely. With e-cores disabled in the BIOS:
|
I'm not seeing any differences as well on my computer with a Xeon 1240 v2, RX470, and Ubuntu 24.
You can also try to lock the clock speed on the CPU just in case the different scheduling modes affect how it does power saving and frequency scaling. |
You could try turning off ACPI in BIOS as the next step. Your machine might run hot, but for a test, it should be OK. |
My BIOS doesn't have an option to entirely disable ACPI, so I tried various CPU enablement/power management options instead. I think I had e-cores disabled for all of these experiments. These two options both gave full performance with default priority:
These options did not restore performance:
I'm not sure what to conclude from this, turbo boost is pretty different from C-states, though maybe the boosted cores are less likely to go into a C-state or something like that. |
Folks can you give this #12995 a shot on your setups. |
Thanks. It doesn't work for me as-is because the SetThreadInformation call isn't executed - I assume that's just for the thread pool? I tried the same call from ggml_vk_instance_init and it does restore the performance. Another thing I've tried that restores performance is to temporarily bump the thread priority while waiting on the fence (master...jeffbolznv:llama.cpp:thread_priority). Application developers often don't like libraries to override thread/process state, so this has the benefit of being transient and shouldn't impact scheduling in a meaningful way. |
Yep. Threadpool and OMP. I keep forgetting that we don't use it in other backends.
Perfect!
Yep, that was the "hack" I've been using for Windows on ARM64 benchmarks as well. -- |
IIUC, the idea of doing this is to make it slightly more convenient for the user of |
Name and Version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
version: 5143 (b43d89e)
built with MSVC 19.35.32217.1 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
Problem description & steps to reproduce
I've noticed recently that ggml-vulkan performance depends more on thread/process priority than expected. For example, comparing normal priority to above_normal:
Performance is also noticeably more variable with default priority. I think this is related to CPU latency after waiting on a fence, and I thought I had improved this with #12630, but it seems to be back and I don't understand why. I don't think it's related to driver version. I kind of suspect OS update, but I'm not sure.
I'd like to crowdsource some data on what systems this affects. If folks could please try these or similar command lines, and report CPU, GPU, driver version, and OS version (for windows please run
winver
and report the OS build number), I'd appreciate it. These results are on core i9-14900k, RTX 4070, driver 576.02, Windows 11 24H2 OS Build 26100.3775.First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: