Use specialized intrinsics for `dot4{I, U}8Packed` on SPIR-V and HLSL #7574

robamler · 2025-04-18T19:48:45Z

Connections

Following @teoxoy's comment in Implement dot4U8Packed and dot4I8Packed #7494 (comment)

Description

Replace the polyfills for dot4I8Packed and dot4U8Packed from #7494 with specialized instructions where they are available. More precisely:

For HLSL, we check if Shader Model >= 6.4, in which case we use dot4add_{i,u}8packed (the specification explicitly states: "no separate capability bit check is required, beyond assuring the use of Shader Model 6.4")
For SPIR-V, we check if the capabilities DotProduct and DotProductInput4x8BitPacked are available, in which case we use the "Packed Vector Format" argument for Op{S,U}Dot (and we emit the corresponding OpCapability and OpExtension statements at the beginning of the output).

If either of these tests fail, or if we are on Metal or GLSL, we fall back to the existing polyfills.

Testing

I added two tests with .toml configuration files that explicitly enable/prevent these specializations.

Squash or Rebase?

I think each commit should pass all CI tests.

Open Questions / Notes

On SPIR-V, the capabilities required for this specialization are only defined if lang_version >= 1.6, but I didn't check for high enough lang_version because I already check the capabilities_available field. This follows precedent in other places in the code base, but I'm not sure if it's sufficient to only check capabilities_available. If capabilities_available == None, then comments in the code (here and here) indicate that this should be interpreted as "all capabilities are permitted", and so the specialized code will be generated. But I guess "all capabilities" is still restricted to "all capabilities that are defined for lang_version". Should I check the lang_version in this case somehow, or is capabilities_available == None only used for debugging anyway? Unfortunately, lang_version isn't exposed anymore at the stage where we do the check, so I'd have to add it as an extra field to the Writer.
- ~~[Update: added a check in be9debd, but it's a bit hacky]~~
- [Update: I misunderstood this, see this comment; resolved in f44bd46.]
~~Unclear whether OpSDot with "Packed Vector Format" expects signed or unsigned arguments in spv (see comment below)~~ [Update: resolved in this comment]
On Metal, we still fall back to the polyfill because, as far as I can tell, msl doesn't have builtin support for integer dot products (but it does support packed integer vectors, see packed_char4 in Table 2.4 in the specification).

Checklist

Run cargo fmt.
Run taplo format.
Run cargo clippy --tests. If applicable, add:
- --target wasm32-unknown-unknown
Run cargo xtask test to run tests.
If this contains user-facing changes, add a CHANGELOG.md entry.

naga/src/back/spv/block.rs

When checking for capabilities in SPIR-V, `capabilities_available == None` indicates that all capabilities are available. However, some capabilities are not even defined for all language versions, so we still need to check if the requested capabilities even exist in the language version we're using.

robamler · 2025-04-18T22:44:19Z

On SPIR-V, the capabilities required for this specialization are only defined if lang_version >= 1.6, but I didn't check for high enough lang_version because I already check the capabilities_available field. This follows precedent in other places in the code base, but I'm not sure if it's sufficient to only check capabilities_available. If capabilities_available == None, then comments in the code (here and here) indicate that this should be interpreted as "all capabilities are permitted", and so the specialized code will be generated. But I guess "all capabilities" is still restricted to "all capabilities that are defined for lang_version". Should I check the lang_version in this case somehow, or is capabilities_available == None only used for debugging anyway? Unfortunately, lang_version isn't exposed anymore at the stage where we do the check, so I'd have to add it as an extra field to the Writer.

In case the above text is confusing: I pushed be9debd as an example of how an additional check for lang_version could be implemented, but it's a bit hacky and I'm also not completely sure if this check is necessary (I don't know if capabilities_available == None is supposed to occur outside of tests).

robamler · 2025-04-19T15:55:51Z

The sign issue has been resolved. The implementation should be correct as is. Turns out SPIR-V doesn't care about signedness in this case: "all signed and unsigned operations always work on unsigned types, and the semantics of operation come from the opcode" (credits to Nicol Bolas, see this StackOverflow answer).

See gfx-rs#7574, in particular <gfx-rs#7574 (comment)>. Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is available on `Adapter`s that support the specialized implementations for `dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product". If this feature is requested during `Device` creation, the device is set up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their respective specialized instructions. This means that, on a vulkan `Device`, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since availability of the feature already guarantees that Shader Model >= 6.4, which is all we need to generate specialized code).

Using the specialized instructions from <gfx-rs/wgpu#7574>.

robamler · 2025-04-22T16:51:30Z

I separated out the above issue of how these naga optimizations can be used in wgpu to #7595. I think the present PR can be reviewed independently from #7595.

See gfx-rs#7574, in particular <gfx-rs#7574 (comment)>. Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is available on `Adapter`s that support the specialized implementations for `dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product"). If this feature is requested during `Device` creation, the device is set up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their respective specialized instructions. This means that, on a vulkan `Device`, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since availability of the feature already guarantees that Shader Model >= 6.4, which is all we need to generate specialized code).

robamler · 2025-04-24T12:07:07Z

Sorry for the back-and-forth on this one. I realized that I think I misunderstood the interplay between SPIR-V versions, SPIR-V extensions, and vulkan device extensions. My current understanding is:

The vulkan device extension VK_KHR_shader_integer_dot_product has a "dependency" on the SPIR-V extension SPV_KHR_integer_dot_product.
The latter requires only SPIR-V 1.0, and it defines ops and capabilities with the "KHR" suffix (e.g., OpSDotKHR and DotProductInput4x8BitPackedKHR).
These ops and capabilities seem to be integrated into SPIR-V proper (i.e., without requiring an extension) as of SPIR-V version 1.6 (see, e.g., right column here); here, their "KHR" suffix may be dropped (but may also still be used for backward compatibility).

Thus, as far as I understand now, the specialized instructions required for this optimization to work on SPIR-V should be available if we're either

on SPIR-V 1.6; or
on an older version of SPIR-V, and we requested the SPIR-V extension SPV_KHR_integer_dot_product, which we can do on older versions of SPIR-V as long as the vulkan adapter has device extension VK_KHR_shader_integer_dot_product.

In either case, we still have to check if the capabilities DotProductKHR and DotProductInput4x8BitPackedKHR are available (and request them).

Is my understanding correct? I'd appreciate any corrections because I want to make sure I don't introduce any regressions on exotic platforms.

teoxoy · 2025-04-25T13:45:14Z

Is my understanding correct? I'd appreciate any corrections because I want to make sure I don't introduce any regressions on exotic platforms.

That is correct. A few pointers in this direction: Writer::capabilities_available contains the capabilities passed by wgpu-hal so, it's enough to consult if the DotProduct & DotProductInput4x8BitPacked capabilities are present in capabilities_available, then based on that emit the fast path. Additionally, if the capabilities are present but we are emitting a version of SPIR-V lower than 1.6, only then call use_extension("SPV_KHR_integer_dot_product").

teoxoy · 2025-04-25T13:46:56Z

I don't know if capabilities_available == None is supposed to occur outside of tests

If capabilities_available is empty, that should really mean that no extra capabilities can be used.
Snapshot tests can also specify capabilities if needed (via the toml).

teoxoy · 2025-04-25T13:51:36Z

On Metal, we still fall back to the polyfill because, as far as I can tell, msl doesn't have builtin support for integer dot products (but it does support packed integer vectors, see packed_char4 in Table 2.4 in the specification).

It could be that if we use packed vectors as in gpuweb/gpuweb#2677 (comment) (or `char`` directly), the Metal compiler will compile it down to more specialized instructions.

Emit optimized code for `dot4{I, U}8Packed` regardless of SPIR-V version as long as the required capabilities are available. On SPIR-V < 1.6, require the extension "SPV_KHR_integer_dot_product" for this. On SPIR-V >= 1.6, don't require the extension because the corresponding capabilities are part of SPIR-V >= 1.6 proper.

robamler · 2025-04-27T20:39:39Z

Thank you so much for the detailed feedback!

[...] it's enough to consult if the DotProduct & DotProductInput4x8BitPacked capabilities are present in capabilities_available, then based on that emit the fast path. Additionally, if the capabilities are present but we are emitting a version of SPIR-V lower than 1.6, only then call use_extension("SPV_KHR_integer_dot_product").

Done (in f44bd46).

It could be that if we use packed vectors as in gpuweb/gpuweb#2677 (comment) (or char directly), the Metal compiler will compile it down to more specialized instructions.

Thanks for the pointer! I implemented this in 848def1, but validating the generated msl code leads to error messages like "error: as_type cast from 'unsigned int' to 'packed_uchar4' (packed vector of 4 'unsigned char' values) is not allowed" (see CI output). I'm surprised by this since the MSL specification states in Section 2.20:

Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not a pointer) to be reinterpreted as another scalar or vector data type of the same size.

And Table 2.4 in Section 2.2.3 states the size of packed_char4 as 4 bytes (i.e., same as uint; it does have lower alignment requirements than uint, but that should only be an issue for the opposite conversion, i.e., from packed_char4 to uint, which we don't want to do here). Maybe packed_char4 doesn't count as a "scalar or vector data type"?

I've removed this commit from the PR for now (so Metal falls back again to the original polyfill).

teoxoy · 2025-04-28T14:34:52Z

It seems to only work in Metal 2.1+ https://shader-playground.timjones.io/6f0764f9a4538417e52931dd96849619, but I can't find this being mentioned in their spec. We can land this in a separate PR though since I think this one looks good already.

This might allow the Metal compiler to emit faster code (but that's not confirmed). See <gpuweb/gpuweb#2677 (comment)> for the optimization. The limitation to Metal 2.1+ is discussed here: <gfx-rs#7574 (comment)>.

robamler added 3 commits April 18, 2025 21:25

Use intrinsics for dot4{I, U}8Packed in HLSL

b7691d3

Use intrinsics for dot4{I, U}8Packed on spv

ae97f6d

Add intrinsics for dot4{I,U}8Packed to changelog

705dc6d

robamler force-pushed the packed-vector-format branch from e06d774 to 705dc6d Compare April 18, 2025 19:57

cwfitzgerald changed the title ~~Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HSLS~~ Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL Apr 18, 2025

cwfitzgerald reviewed Apr 18, 2025

View reviewed changes

naga/src/back/spv/block.rs Outdated Show resolved Hide resolved

robamler force-pushed the packed-vector-format branch from c93b2dd to be9debd Compare April 18, 2025 22:43

This comment was marked as resolved.

Sign in to view

robamler added a commit to bamler-lab/compressed-nn-ops-demo that referenced this pull request Apr 22, 2025

Use packed vector format in both shaders

609bcd5

Using the specialized instructions from <gfx-rs/wgpu#7574>.

robamler mentioned this pull request Apr 22, 2025

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

Merged

8 tasks

robamler requested a review from cwfitzgerald April 22, 2025 17:13

cwfitzgerald assigned teoxoy Apr 23, 2025

robamler force-pushed the packed-vector-format branch 2 times, most recently from 848def1 to f44bd46 Compare April 27, 2025 20:19

teoxoy approved these changes Apr 28, 2025

View reviewed changes

teoxoy merged commit bb83976 into gfx-rs:trunk Apr 28, 2025
72 checks passed

robamler deleted the packed-vector-format branch April 30, 2025 09:55

robamler mentioned this pull request May 1, 2025

Potentially optimize dot4{I,U}8Packed on Metal #7653

Open

6 tasks

robamler mentioned this pull request May 3, 2025

[naga] Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl #7664

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use specialized intrinsics for `dot4{I, U}8Packed` on SPIR-V and HLSL #7574

Use specialized intrinsics for `dot4{I, U}8Packed` on SPIR-V and HLSL #7574

robamler commented Apr 18, 2025 •

edited

Loading

robamler commented Apr 18, 2025 •

edited

Loading

This comment was marked as resolved.

robamler commented Apr 19, 2025

This comment was marked as resolved.

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 24, 2025 •

edited

Loading

teoxoy commented Apr 25, 2025

teoxoy commented Apr 25, 2025

teoxoy commented Apr 25, 2025

robamler commented Apr 27, 2025

teoxoy commented Apr 28, 2025

Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574

Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574

Conversation

robamler commented Apr 18, 2025 • edited Loading

robamler commented Apr 18, 2025 • edited Loading

This comment was marked as resolved.

robamler commented Apr 19, 2025

This comment was marked as resolved.

robamler commented Apr 22, 2025 • edited Loading

robamler commented Apr 24, 2025 • edited Loading

teoxoy commented Apr 25, 2025

teoxoy commented Apr 25, 2025

teoxoy commented Apr 25, 2025

robamler commented Apr 27, 2025

teoxoy commented Apr 28, 2025

Use specialized intrinsics for `dot4{I, U}8Packed` on SPIR-V and HLSL #7574

Use specialized intrinsics for `dot4{I, U}8Packed` on SPIR-V and HLSL #7574

robamler commented Apr 18, 2025 •

edited

Loading

robamler commented Apr 18, 2025 •

edited

Loading

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 24, 2025 •

edited

Loading