Skip to content

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 30, 2025

Conversation

robamler
Copy link
Contributor

@robamler robamler commented Apr 22, 2025

Connections

Description

Ensures that the new naga optimizations added in #7574 can be used by wgpu-core on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).

Adds a feature FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT, which is available on Adapters that support the specialized implementations for dot4I8Packed and dot4U8Packed implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").

If this feature is available on an Adapter, it can be requested during Device creation, and then the device is set up such that any occurrences of dot4I8Packed and dot4U8Packed will be compiled to their respective specialized instructions. This means that, on a vulkan Device, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks like wgpu-hal already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).

I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.

Testing

I'm not sure where and how to add a test for this. Probably somewhere in wgpu-hal? Are there any related tests that I can use as templates?

I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature NATIVE_PACKED_INTEGER_DOT_PRODUCT is requested for the Device (these tests also confirm that the optimization in #7574 increases performance).

Squash or Rebase?

Single commit

Open Questions

  • Where and how to add tests?
  • What should we do on WebGPU?
    • Should the adapter detect somehow (how?) whether the WebGPU runtime supports the "packed_4x8_integer_dot_product" extension? If it does, we could make the new feature NATIVE_PACKED_INTEGER_DOT_PRODUCT available on the Adapter and, if it is requested for a Device, translate dot4I8Packed and dot4U8Packed literally (and emit a polyfill if the feature is not requested for the Device).
    • I'm not sure how to do this and would welcome advice.

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@robamler robamler requested a review from a team as a code owner April 22, 2025 16:47
@robamler robamler changed the title Use packed vector format Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed Apr 22, 2025
@robamler robamler force-pushed the use-packed-vector-format branch from 8594f8f to 33ed6fd Compare April 22, 2025 16:54
@robamler
Copy link
Contributor Author

robamler commented Apr 22, 2025

In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models).

The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).

Bench 1: uncompressed matrices Bench 2: compressed matrices
optimizations in #7574 turned on 13.94 ± 0.05 15.99 ± 0.21
optimizations in #7574 turned off 13.15 ± 0.13 13.43 ± 0.36

More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of pack4xU8 and unpack4xU8, which I think aren't optimized yet in naga. I'll look into these when I find some time.

robamler added a commit to robamler/wgpu that referenced this pull request Apr 22, 2025
@robamler
Copy link
Contributor Author

robamler commented Apr 24, 2025

An alternative could involve adding the following field

  shader_integer_dot_product:
      Option<vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR<'static>>,

to wgpu-hal::vulkan::adapter::PhysicalDeviceFeatures. But I'm not sure sure how to communicate this property to naga. Also, I'm not sure how to initialize the field:

// in `PhysicalDeviceFeatures::from_extensions_and_requested_features`:
Self {
    // ...
    shader_integer_dot_product: if enabled_extensions
        .contains(&khr::shader_integer_dot_product::NAME)
    {
        Some(
            vk::PhysicalDeviceShaderIntegerDotProductFeaturesKHR::default()
                .shader_integer_dot_product(requested_features.intersects(
                    // Not sure what to put here. Always enable it just in case we end up using it? Or
                    // use something like `wgt::Features::NATIVE_PACKED_INTEGER_DOT_PRODUCT` after all?
                    todo!(),
                )),
        )
    } else {
        None
    },
}

The general issue here seems to be that, if I understand the code correctly (but I might be wrong), it's currently set up so that naga lists all extensions that it needs, and then wgpu-hal fails if any of these extensions aren't available. So naga can communicate to wgpu-hal that it needs something, but I'm not sure how to communicate from naga to wgpu-hal that it would like to have something (but can still fall back gracefully if it's not available). Is there any way to communicate in the opposite direction, i.e., to let wgpu-hal tell naga that a potentially helpful (but not necessarily critical) extension is available?

@teoxoy
Copy link
Member

teoxoy commented Apr 25, 2025

Is there any way to communicate in the opposite direction, i.e., to let wgpu-hal tell naga that a potentially helpful (but not necessarily critical) extension is available?

This part of the Vulkan backend is somewhat complicated but you can see an example here:

// Optional `VK_KHR_driver_properties`
if self.supports_extension(khr::driver_properties::NAME) {
extensions.push(khr::driver_properties::NAME);
}

Then, related to my other comment #7574 (comment), wgpu-hal would add the capabilities if either the extension is available or the device supports Vulkan 1.3/SPIR-V 1.6.

For pointers on adding the extension boilerplate I'd look at other extensions (via find-all-references) to get an idea at what needs to be done.

@robamler robamler force-pushed the use-packed-vector-format branch from f3ddc4b to 3060c89 Compare April 27, 2025 20:47
@robamler
Copy link
Contributor Author

robamler commented Apr 27, 2025

Thanks, this makes it much easier. Now the optimization is automatically enabled where available, without the user having to explicitly request it.

I'm still unsure how to add a test for this (but I verified again in a debugger that the optimized code does get generated on my machine).

I separated out the introduction of a new function remove_variant to the last commit because I'm not sure about this. I think this is correct, i.e., one should remove the "variant" from the Vulkan API version before comparing to predefined version constants, but this is arguably an unrelated issue (the code contains lots of existing comparisons to predefined version constants that I didn't change).

@robamler robamler requested a review from teoxoy April 27, 2025 21:01
@robamler robamler force-pushed the use-packed-vector-format branch from 3060c89 to ae3159e Compare April 27, 2025 23:04
@teoxoy
Copy link
Member

teoxoy commented Apr 28, 2025

I'm still unsure how to add a test for this (but I verified again in a debugger that the optimized code does get generated on my machine).

I would say the snapshots tests you added in #7574 are enough, we don't currently have a way to test that shaders generated by naga with params from wgpu-hal have a specific form.

I separated out the introduction of a new function remove_variant to the last commit because I'm not sure about this. I think this is correct, i.e., one should remove the "variant" from the Vulkan API version before comparing to predefined version constants, but this is arguably an unrelated issue (the code contains lots of existing comparisons to predefined version constants that I didn't change).

I wasn't aware there was a variant part in the version.

The variant indicates the variant of the Vulkan API supported by the implementation. This is always 0 for the Vulkan API.

A non-zero variant indicates the API is a variant of the Vulkan API and applications will typically need to be modified to run against it. The variant field was a later addition to the version number, added in version 1.2.175 of the Specification. As Vulkan uses variant 0, this change is fully backwards compatible with the previous version number format for Vulkan implementations. New version number macros have been added for this change and the old macros deprecated. For existing applications using the older format and macros, an implementation with non-zero variant will decode as a very high Vulkan version. The high version number should be detectable by applications performing suitable version checking.

Given that "This is always 0 for the Vulkan API." and "applications will typically need to be modified to run against it" this sounds like the device supports a modified version of the Vulkan API. Without knowing what the differences between other variants of the API and the official API are, there isn't a way for us to target those variants.

I think we should always assume the variant is 0 within wgpu-hal's backend. It would be nice to detect if the variant is anything other than 0 at instance and device creation and not expose the instance/device.

@robamler robamler force-pushed the use-packed-vector-format branch from ae3159e to b229326 Compare April 29, 2025 22:01
@robamler
Copy link
Contributor Author

I think we should always assume the variant is 0 within wgpu-hal's backend.

I removed the commit that messed with variant bits.

@robamler robamler requested a review from teoxoy April 29, 2025 22:21
Copy link
Member

@teoxoy teoxoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@teoxoy
Copy link
Member

teoxoy commented Apr 30, 2025

Could you rebase the PR, just to make sure there aren't any differences between what landed in #7574 and the commits in this PR?

@robamler robamler force-pushed the use-packed-vector-format branch from b576898 to fc96ad5 Compare April 30, 2025 09:30
@robamler robamler force-pushed the use-packed-vector-format branch from fc96ad5 to 9c1a8fa Compare April 30, 2025 09:32
@robamler
Copy link
Contributor Author

Done. Thank you very much for your patience with me on this one!

CI currently keeps failing at different places, but I think these are unrelated network issues (first try, second try).

@teoxoy teoxoy enabled auto-merge (squash) April 30, 2025 10:55
@teoxoy teoxoy merged commit 0d06284 into gfx-rs:trunk Apr 30, 2025
65 of 66 checks passed
@robamler robamler deleted the use-packed-vector-format branch April 30, 2025 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants