Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler · 2025-04-22T16:47:43Z

Connections

Depends on Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574.
- Separates out the issue described in this comment.

Description

Ensures that the new naga optimizations added in #7574 can be used by wgpu-core on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).

Adds a feature FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT, which is available on Adapters that support the specialized implementations for dot4I8Packed and dot4U8Packed implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").

If this feature is available on an Adapter, it can be requested during Device creation, and then the device is set up such that any occurrences of dot4I8Packed and dot4U8Packed will be compiled to their respective specialized instructions. This means that, on a vulkan Device, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks like wgpu-hal already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).

I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.

Testing

I'm not sure where and how to add a test for this. Probably somewhere in wgpu-hal? Are there any related tests that I can use as templates?

I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature NATIVE_PACKED_INTEGER_DOT_PRODUCT is requested for the Device (these tests also confirm that the optimization in #7574 increases performance).

Squash or Rebase?

Single commit

Open Questions

Where and how to add tests?
What should we do on WebGPU?
- Should the adapter detect somehow (how?) whether the WebGPU runtime supports the "packed_4x8_integer_dot_product" extension? If it does, we could make the new feature NATIVE_PACKED_INTEGER_DOT_PRODUCT available on the Adapter and, if it is requested for a Device, translate dot4I8Packed and dot4U8Packed literally (and emit a polyfill if the feature is not requested for the Device).
- I'm not sure how to do this and would welcome advice.

Checklist

Run cargo fmt.
Run taplo format.
Run cargo clippy --tests. If applicable, add:
- --target wasm32-unknown-unknown
Run cargo xtask test to run tests.
If this contains user-facing changes, add a CHANGELOG.md entry.

robamler · 2025-04-22T17:07:35Z

In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models).

The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).

	Bench 1: uncompressed matrices	Bench 2: compressed matrices
optimizations in #7574 turned on	13.94 ± 0.05	15.99 ± 0.21
optimizations in #7574 turned off	13.15 ± 0.13	13.43 ± 0.36

More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of pack4xU8 and unpack4xU8, which I think aren't optimized yet in naga. I'll look into these when I find some time.

robamler · 2025-04-24T11:39:03Z

An alternative could involve adding the following field

  shader_integer_dot_product:
      Option<vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR<'static>>,

to wgpu-hal::vulkan::adapter::PhysicalDeviceFeatures. But I'm not sure sure how to communicate this property to naga. Also, I'm not sure how to initialize the field:

// in `PhysicalDeviceFeatures::from_extensions_and_requested_features`:
Self {
    // ...
    shader_integer_dot_product: if enabled_extensions
        .contains(&khr::shader_integer_dot_product::NAME)
    {
        Some(
            vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR::default()
                .shader_integer_dot_product(requested_features.intersects(
                    // Not sure what to put here. Always enable it just in case we end up using it? Or
                    // use something like `wgt::Features::NATIVE_PACKED_INTEGER_DOT_PRODUCT` after all?
                    todo!(),
                )),
        )
    } else {
        None
    },
}

The general issue here seems to be that, if I understand the code correctly (but I might be wrong), it's currently set up so that naga lists all extensions that it needs, and then wgpu-hal fails if any of these extensions aren't available. So naga can communicate to wgpu-hal that it needs something, but I'm not sure how to communicate from naga to wgpu-hal that it would like to have something (but can still fall back gracefully if it's not available). Is there any way to communicate in the opposite direction, i.e., to let wgpu-hal tell naga that a potentially helpful (but not necessarily critical) extension is available?

wgpu-hal/src/vulkan/adapter.rs

wgpu-types/src/features.rs

teoxoy · 2025-04-25T14:22:57Z

Is there any way to communicate in the opposite direction, i.e., to let wgpu-hal tell naga that a potentially helpful (but not necessarily critical) extension is available?

This part of the Vulkan backend is somewhat complicated but you can see an example here:

wgpu/wgpu-hal/src/vulkan/adapter.rs

Lines 963 to 966 in 6a7aa14

    
           // Optional `VK_KHR_driver_properties` 
        
           if self.supports_extension(khr::driver_properties::NAME) { 
        
               extensions.push(khr::driver_properties::NAME); 
        
           }

Then, related to my other comment #7574 (comment), wgpu-hal would add the capabilities if either the extension is available or the device supports Vulkan 1.3/SPIR-V 1.6.

For pointers on adding the extension boilerplate I'd look at other extensions (via find-all-references) to get an idea at what needs to be done.

robamler · 2025-04-27T21:00:56Z

Thanks, this makes it much easier. Now the optimization is automatically enabled where available, without the user having to explicitly request it.

I'm still unsure how to add a test for this (but I verified again in a debugger that the optimized code does get generated on my machine).

I separated out the introduction of a new function remove_variant to the last commit because I'm not sure about this. I think this is correct, i.e., one should remove the "variant" from the Vulkan API version before comparing to predefined version constants, but this is arguably an unrelated issue (the code contains lots of existing comparisons to predefined version constants that I didn't change).

teoxoy · 2025-04-28T15:12:04Z

I'm still unsure how to add a test for this (but I verified again in a debugger that the optimized code does get generated on my machine).

I would say the snapshots tests you added in #7574 are enough, we don't currently have a way to test that shaders generated by naga with params from wgpu-hal have a specific form.

I separated out the introduction of a new function remove_variant to the last commit because I'm not sure about this. I think this is correct, i.e., one should remove the "variant" from the Vulkan API version before comparing to predefined version constants, but this is arguably an unrelated issue (the code contains lots of existing comparisons to predefined version constants that I didn't change).

I wasn't aware there was a variant part in the version.

The variant indicates the variant of the Vulkan API supported by the implementation. This is always 0 for the Vulkan API.

A non-zero variant indicates the API is a variant of the Vulkan API and applications will typically need to be modified to run against it. The variant field was a later addition to the version number, added in version 1.2.175 of the Specification. As Vulkan uses variant 0, this change is fully backwards compatible with the previous version number format for Vulkan implementations. New version number macros have been added for this change and the old macros deprecated. For existing applications using the older format and macros, an implementation with non-zero variant will decode as a very high Vulkan version. The high version number should be detectable by applications performing suitable version checking.

Given that "This is always 0 for the Vulkan API." and "applications will typically need to be modified to run against it" this sounds like the device supports a modified version of the Vulkan API. Without knowing what the differences between other variants of the API and the official API are, there isn't a way for us to target those variants.

I think we should always assume the variant is 0 within wgpu-hal's backend. It would be nice to detect if the variant is anything other than 0 at instance and device creation and not expose the instance/device.

wgpu-hal/src/vulkan/adapter.rs

robamler · 2025-04-29T22:19:33Z

I think we should always assume the variant is 0 within wgpu-hal's backend.

I removed the commit that messed with variant bits.

teoxoy

Looks great, thank you!

teoxoy · 2025-04-30T08:27:05Z

Could you rebase the PR, just to make sure there aren't any differences between what landed in #7574 and the commits in this PR?

…ger_dot_product`

robamler · 2025-04-30T09:43:00Z

Done. Thank you very much for your patience with me on this one!

CI currently keeps failing at different places, but I think these are unrelated network issues (first try, second try).

robamler requested a review from a team as a code owner April 22, 2025 16:47

robamler changed the title ~~Use packed vector format~~ Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed Apr 22, 2025

robamler mentioned this pull request Apr 22, 2025

Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574

Merged

8 tasks

robamler force-pushed the use-packed-vector-format branch from 8594f8f to 33ed6fd Compare April 22, 2025 16:54

robamler added a commit to robamler/wgpu that referenced this pull request Apr 22, 2025

Add changelog entry for gfx-rs#7595

f3ddc4b

cwfitzgerald assigned teoxoy Apr 23, 2025

teoxoy reviewed Apr 25, 2025

View reviewed changes

wgpu-hal/src/vulkan/adapter.rs Outdated Show resolved Hide resolved

teoxoy reviewed Apr 25, 2025

View reviewed changes

wgpu-types/src/features.rs Outdated Show resolved Hide resolved

robamler force-pushed the use-packed-vector-format branch from f3ddc4b to 3060c89 Compare April 27, 2025 20:47

robamler requested a review from teoxoy April 27, 2025 21:01

robamler force-pushed the use-packed-vector-format branch from 3060c89 to ae3159e Compare April 27, 2025 23:04

teoxoy requested changes Apr 28, 2025

View reviewed changes

wgpu-hal/src/vulkan/adapter.rs Outdated Show resolved Hide resolved

robamler force-pushed the use-packed-vector-format branch from ae3159e to b229326 Compare April 29, 2025 22:01

robamler requested a review from teoxoy April 29, 2025 22:21

teoxoy approved these changes Apr 30, 2025

View reviewed changes

robamler added 2 commits April 30, 2025 11:28

[wgpu-hal] Use highest available SPIR-V version

4923ad1

[wgpu-hal] Expose capabilities DotProductInput*

6b340e5

robamler force-pushed the use-packed-vector-format branch from b576898 to fc96ad5 Compare April 30, 2025 09:30

Introduce `{PhysicalDeviceFeatures, PrivateCapabilities}::shader_inte…

9c1a8fa

…ger_dot_product`

robamler force-pushed the use-packed-vector-format branch from fc96ad5 to 9c1a8fa Compare April 30, 2025 09:32

teoxoy enabled auto-merge (squash) April 30, 2025 10:55

teoxoy merged commit 0d06284 into gfx-rs:trunk Apr 30, 2025
65 of 66 checks passed

robamler deleted the use-packed-vector-format branch April 30, 2025 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 24, 2025 •

edited

Loading

teoxoy commented Apr 25, 2025

robamler commented Apr 27, 2025 •

edited

Loading

teoxoy commented Apr 28, 2025

robamler commented Apr 29, 2025

teoxoy left a comment

teoxoy commented Apr 30, 2025

robamler commented Apr 30, 2025

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

Conversation

robamler commented Apr 22, 2025 • edited Loading

robamler commented Apr 22, 2025 • edited Loading

robamler commented Apr 24, 2025 • edited Loading

teoxoy commented Apr 25, 2025

robamler commented Apr 27, 2025 • edited Loading

teoxoy commented Apr 28, 2025

robamler commented Apr 29, 2025

teoxoy left a comment

Choose a reason for hiding this comment

teoxoy commented Apr 30, 2025

robamler commented Apr 30, 2025

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 24, 2025 •

edited

Loading

robamler commented Apr 27, 2025 •

edited

Loading