Skip to content

Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 28, 2025

Conversation

robamler
Copy link
Contributor

@robamler robamler commented Apr 18, 2025

Connections

Description

Replace the polyfills for dot4I8Packed and dot4U8Packed from #7494 with specialized instructions where they are available. More precisely:

  • For HLSL, we check if Shader Model >= 6.4, in which case we use dot4add_{i,u}8packed (the specification explicitly states: "no separate capability bit check is required, beyond assuring the use of Shader Model 6.4")
  • For SPIR-V, we check if the capabilities DotProduct and DotProductInput4x8BitPacked are available, in which case we use the "Packed Vector Format" argument for Op{S,U}Dot (and we emit the corresponding OpCapability and OpExtension statements at the beginning of the output).

If either of these tests fail, or if we are on Metal or GLSL, we fall back to the existing polyfills.

Testing

I added two tests with .toml configuration files that explicitly enable/prevent these specializations.

Squash or Rebase?

I think each commit should pass all CI tests.

Open Questions / Notes

  • On SPIR-V, the capabilities required for this specialization are only defined if lang_version >= 1.6, but I didn't check for high enough lang_version because I already check the capabilities_available field. This follows precedent in other places in the code base, but I'm not sure if it's sufficient to only check capabilities_available. If capabilities_available == None, then comments in the code (here and here) indicate that this should be interpreted as "all capabilities are permitted", and so the specialized code will be generated. But I guess "all capabilities" is still restricted to "all capabilities that are defined for lang_version". Should I check the lang_version in this case somehow, or is capabilities_available == None only used for debugging anyway? Unfortunately, lang_version isn't exposed anymore at the stage where we do the check, so I'd have to add it as an extra field to the Writer.
  • Unclear whether OpSDot with "Packed Vector Format" expects signed or unsigned arguments in spv (see comment below) [Update: resolved in this comment]
  • On Metal, we still fall back to the polyfill because, as far as I can tell, msl doesn't have builtin support for integer dot products (but it does support packed integer vectors, see packed_char4 in Table 2.4 in the specification).

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@robamler robamler force-pushed the packed-vector-format branch from e06d774 to 705dc6d Compare April 18, 2025 19:57
@cwfitzgerald cwfitzgerald changed the title Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HSLS Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL Apr 18, 2025
When checking for capabilities in SPIR-V,
`capabilities_available == None` indicates that all capabilities are
available. However, some capabilities are not even defined for all
language versions, so we still need to check if the requested
capabilities even exist in the language version we're using.
@robamler robamler force-pushed the packed-vector-format branch from c93b2dd to be9debd Compare April 18, 2025 22:43
@robamler
Copy link
Contributor Author

robamler commented Apr 18, 2025

On SPIR-V, the capabilities required for this specialization are only defined if lang_version >= 1.6, but I didn't check for high enough lang_version because I already check the capabilities_available field. This follows precedent in other places in the code base, but I'm not sure if it's sufficient to only check capabilities_available. If capabilities_available == None, then comments in the code (here and here) indicate that this should be interpreted as "all capabilities are permitted", and so the specialized code will be generated. But I guess "all capabilities" is still restricted to "all capabilities that are defined for lang_version". Should I check the lang_version in this case somehow, or is capabilities_available == None only used for debugging anyway? Unfortunately, lang_version isn't exposed anymore at the stage where we do the check, so I'd have to add it as an extra field to the Writer.

In case the above text is confusing: I pushed be9debd as an example of how an additional check for lang_version could be implemented, but it's a bit hacky and I'm also not completely sure if this check is necessary (I don't know if capabilities_available == None is supposed to occur outside of tests).

@robamler

This comment was marked as resolved.

@robamler
Copy link
Contributor Author

The sign issue has been resolved. The implementation should be correct as is. Turns out SPIR-V doesn't care about signedness in this case: "all signed and unsigned operations always work on unsigned types, and the semantics of operation come from the opcode" (credits to Nicol Bolas, see this StackOverflow answer).

@robamler

This comment was marked as resolved.

robamler added a commit to robamler/wgpu that referenced this pull request Apr 22, 2025
See gfx-rs#7574, in particular
<gfx-rs#7574 (comment)>.

Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is
available on `Adapter`s that support the specialized implementations for
`dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this
includes DX12 with Shader Model >= 6.4 and Vulkan with device extension
"VK_KHR_shader_integer_dot_product".

If this feature is requested during `Device` creation, the device is set
up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their
respective specialized instructions. This means that, on a vulkan
`Device`, the SPIR-V language version is set to 1.6, and the required
SPIR-V capabilities are marked as available (on DX12, requesting the
feature doesn't change anything since availability of the feature
already guarantees that Shader Model >= 6.4, which is all we need to
generate specialized code).
robamler added a commit to bamler-lab/compressed-nn-ops-demo that referenced this pull request Apr 22, 2025
Using the specialized instructions from
<gfx-rs/wgpu#7574>.
@robamler
Copy link
Contributor Author

robamler commented Apr 22, 2025

I separated out the above issue of how these naga optimizations can be used in wgpu to #7595. I think the present PR can be reviewed independently from #7595.

robamler added a commit to robamler/wgpu that referenced this pull request Apr 22, 2025
See gfx-rs#7574, in particular
<gfx-rs#7574 (comment)>.

Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is
available on `Adapter`s that support the specialized implementations for
`dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this
includes DX12 with Shader Model >= 6.4 and Vulkan with device extension
"VK_KHR_shader_integer_dot_product").

If this feature is requested during `Device` creation, the device is set
up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their
respective specialized instructions. This means that, on a vulkan
`Device`, the SPIR-V language version is set to 1.6, and the required
SPIR-V capabilities are marked as available (on DX12, requesting the
feature doesn't change anything since availability of the feature
already guarantees that Shader Model >= 6.4, which is all we need to
generate specialized code).
@robamler robamler requested a review from cwfitzgerald April 22, 2025 17:13
@robamler
Copy link
Contributor Author

robamler commented Apr 24, 2025

Sorry for the back-and-forth on this one. I realized that I think I misunderstood the interplay between SPIR-V versions, SPIR-V extensions, and vulkan device extensions. My current understanding is:

  • The vulkan device extension VK_KHR_shader_integer_dot_product has a "dependency" on the SPIR-V extension SPV_KHR_integer_dot_product.
  • The latter requires only SPIR-V 1.0, and it defines ops and capabilities with the "KHR" suffix (e.g., OpSDotKHR and DotProductInput4x8BitPackedKHR).
  • These ops and capabilities seem to be integrated into SPIR-V proper (i.e., without requiring an extension) as of SPIR-V version 1.6 (see, e.g., right column here); here, their "KHR" suffix may be dropped (but may also still be used for backward compatibility).

Thus, as far as I understand now, the specialized instructions required for this optimization to work on SPIR-V should be available if we're either

  • on SPIR-V 1.6; or
  • on an older version of SPIR-V, and we requested the SPIR-V extension SPV_KHR_integer_dot_product, which we can do on older versions of SPIR-V as long as the vulkan adapter has device extension VK_KHR_shader_integer_dot_product.

In either case, we still have to check if the capabilities DotProductKHR and DotProductInput4x8BitPackedKHR are available (and request them).

Is my understanding correct? I'd appreciate any corrections because I want to make sure I don't introduce any regressions on exotic platforms.

@teoxoy
Copy link
Member

teoxoy commented Apr 25, 2025

Is my understanding correct? I'd appreciate any corrections because I want to make sure I don't introduce any regressions on exotic platforms.

That is correct. A few pointers in this direction: Writer::capabilities_available contains the capabilities passed by wgpu-hal so, it's enough to consult if the DotProduct & DotProductInput4x8BitPacked capabilities are present in capabilities_available, then based on that emit the fast path. Additionally, if the capabilities are present but we are emitting a version of SPIR-V lower than 1.6, only then call use_extension("SPV_KHR_integer_dot_product").

@teoxoy
Copy link
Member

teoxoy commented Apr 25, 2025

I don't know if capabilities_available == None is supposed to occur outside of tests

If capabilities_available is empty, that should really mean that no extra capabilities can be used.
Snapshot tests can also specify capabilities if needed (via the toml).

@teoxoy
Copy link
Member

teoxoy commented Apr 25, 2025

  • On Metal, we still fall back to the polyfill because, as far as I can tell, msl doesn't have builtin support for integer dot products (but it does support packed integer vectors, see packed_char4 in Table 2.4 in the specification).

It could be that if we use packed vectors as in gpuweb/gpuweb#2677 (comment) (or `char`` directly), the Metal compiler will compile it down to more specialized instructions.

Emit optimized code for `dot4{I, U}8Packed` regardless of SPIR-V version
as long as the required capabilities are available. On SPIR-V < 1.6,
require the extension "SPV_KHR_integer_dot_product" for this. On
SPIR-V >= 1.6, don't require the extension because the corresponding
capabilities are part of SPIR-V >= 1.6 proper.
@robamler robamler force-pushed the packed-vector-format branch 2 times, most recently from 848def1 to f44bd46 Compare April 27, 2025 20:19
@robamler
Copy link
Contributor Author

Thank you so much for the detailed feedback!

[...] it's enough to consult if the DotProduct & DotProductInput4x8BitPacked capabilities are present in capabilities_available, then based on that emit the fast path. Additionally, if the capabilities are present but we are emitting a version of SPIR-V lower than 1.6, only then call use_extension("SPV_KHR_integer_dot_product").

Done (in f44bd46).

It could be that if we use packed vectors as in gpuweb/gpuweb#2677 (comment) (or char directly), the Metal compiler will compile it down to more specialized instructions.

Thanks for the pointer! I implemented this in 848def1, but validating the generated msl code leads to error messages like "error: as_type cast from 'unsigned int' to 'packed_uchar4' (packed vector of 4 'unsigned char' values) is not allowed" (see CI output). I'm surprised by this since the MSL specification states in Section 2.20:

Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not a pointer) to be reinterpreted as another scalar or vector data type of the same size.

And Table 2.4 in Section 2.2.3 states the size of packed_char4 as 4 bytes (i.e., same as uint; it does have lower alignment requirements than uint, but that should only be an issue for the opposite conversion, i.e., from packed_char4 to uint, which we don't want to do here). Maybe packed_char4 doesn't count as a "scalar or vector data type"?

I've removed this commit from the PR for now (so Metal falls back again to the original polyfill).

@teoxoy
Copy link
Member

teoxoy commented Apr 28, 2025

It seems to only work in Metal 2.1+ https://shader-playground.timjones.io/6f0764f9a4538417e52931dd96849619, but I can't find this being mentioned in their spec. We can land this in a separate PR though since I think this one looks good already.

@teoxoy teoxoy merged commit bb83976 into gfx-rs:trunk Apr 28, 2025
72 checks passed
@robamler robamler deleted the packed-vector-format branch April 30, 2025 09:55
robamler added a commit to robamler/wgpu that referenced this pull request May 1, 2025
This might allow the Metal compiler to emit faster code (but that's not
confirmed). See
<gpuweb/gpuweb#2677 (comment)>
for the optimization. The limitation to Metal 2.1+ is discussed here:
<gfx-rs#7574 (comment)>.
robamler added a commit to robamler/wgpu that referenced this pull request May 1, 2025
This might allow the Metal compiler to emit faster code (but that's not
confirmed). See
<gpuweb/gpuweb#2677 (comment)>
for the optimization. The limitation to Metal 2.1+ is discussed here:
<gfx-rs#7574 (comment)>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants