-
Notifications
You must be signed in to change notification settings - Fork 1k
Allow wgpu-core
to use new naga optimizations for dot4{I, U}8Packed
#7595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wgpu-core
to use new naga optimizations for dot4{I, U}8Packed
8594f8f
to
33ed6fd
Compare
In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models). The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).
More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of |
An alternative could involve adding the following field shader_integer_dot_product:
Option<vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR<'static>>, to // in `PhysicalDeviceFeatures::from_extensions_and_requested_features`:
Self {
// ...
shader_integer_dot_product: if enabled_extensions
.contains(&khr::shader_integer_dot_product::NAME)
{
Some(
vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR::default()
.shader_integer_dot_product(requested_features.intersects(
// Not sure what to put here. Always enable it just in case we end up using it? Or
// use something like `wgt::Features::NATIVE_PACKED_INTEGER_DOT_PRODUCT` after all?
todo!(),
)),
)
} else {
None
},
} The general issue here seems to be that, if I understand the code correctly (but I might be wrong), it's currently set up so that naga lists all extensions that it needs, and then wgpu-hal fails if any of these extensions aren't available. So naga can communicate to wgpu-hal that it needs something, but I'm not sure how to communicate from naga to wgpu-hal that it would like to have something (but can still fall back gracefully if it's not available). Is there any way to communicate in the opposite direction, i.e., to let wgpu-hal tell naga that a potentially helpful (but not necessarily critical) extension is available? |
This part of the Vulkan backend is somewhat complicated but you can see an example here: wgpu/wgpu-hal/src/vulkan/adapter.rs Lines 963 to 966 in 6a7aa14
Then, related to my other comment #7574 (comment), wgpu-hal would add the capabilities if either the extension is available or the device supports Vulkan 1.3/SPIR-V 1.6. For pointers on adding the extension boilerplate I'd look at other extensions (via find-all-references) to get an idea at what needs to be done. |
f3ddc4b
to
3060c89
Compare
Thanks, this makes it much easier. Now the optimization is automatically enabled where available, without the user having to explicitly request it. I'm still unsure how to add a test for this (but I verified again in a debugger that the optimized code does get generated on my machine). I separated out the introduction of a new function |
3060c89
to
ae3159e
Compare
I would say the snapshots tests you added in #7574 are enough, we don't currently have a way to test that shaders generated by naga with params from wgpu-hal have a specific form.
I wasn't aware there was a variant part in the version.
Given that "This is always 0 for the Vulkan API." and "applications will typically need to be modified to run against it" this sounds like the device supports a modified version of the Vulkan API. Without knowing what the differences between other variants of the API and the official API are, there isn't a way for us to target those variants. I think we should always assume the variant is 0 within wgpu-hal's backend. It would be nice to detect if the variant is anything other than 0 at instance and device creation and not expose the instance/device. |
ae3159e
to
b229326
Compare
I removed the commit that messed with variant bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
Could you rebase the PR, just to make sure there aren't any differences between what landed in #7574 and the commits in this PR? |
b576898
to
fc96ad5
Compare
fc96ad5
to
9c1a8fa
Compare
Done. Thank you very much for your patience with me on this one! CI currently keeps failing at different places, but I think these are unrelated network issues (first try, second try). |
Connections
dot4{I, U}8Packed
on SPIR-V and HLSL #7574.Description
Ensures that the new
naga
optimizations added in #7574 can be used bywgpu-core
on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).Adds a feature
FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT
, which is available onAdapter
s that support the specialized implementations fordot4I8Packed
anddot4U8Packed
implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").If this feature is available on an
Adapter
, it can be requested duringDevice
creation, and then the device is set up such that any occurrences ofdot4I8Packed
anddot4U8Packed
will be compiled to their respective specialized instructions. This means that, on a vulkanDevice
, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks likewgpu-hal
already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.
Testing
I'm not sure where and how to add a test for this. Probably somewhere in
wgpu-hal
? Are there any related tests that I can use as templates?I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature
NATIVE_PACKED_INTEGER_DOT_PRODUCT
is requested for theDevice
(these tests also confirm that the optimization in #7574 increases performance).Squash or Rebase?
Single commit
Open Questions
NATIVE_PACKED_INTEGER_DOT_PRODUCT
available on theAdapter
and, if it is requested for aDevice
, translatedot4I8Packed
anddot4U8Packed
literally (and emit a polyfill if the feature is not requested for theDevice
).Checklist
cargo fmt
.taplo format
.cargo clippy --tests
. If applicable, add:--target wasm32-unknown-unknown
cargo xtask test
to run tests.CHANGELOG.md
entry.