[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10236

pytorchbot · 2025-04-16T18:52:05Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #10205 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/213/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/213/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/212/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/213/orig
@diff-train-skip-merge

…d checks for physical limits when allocating Pull Request resolved: #9974 ## Context At a high level, this diff addresses preventing the allocation of textures that exceed physical texture limits, especially in the context of running transformer models. Currently, the groupwise quantized int4 linear op implementation sets the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device. Exceeding the maximum image extents can lead to undefined behaviour, and therefore should be avoided. Also related, the Vulkan delegate did not properly understand the maximum image extents properly. The physical device limits has three fields that indicate maximum image extents: * `maxImageDimension1D` * `maxImageDimension2D` * `maxImageDimension3D` Currently, the delegate interprets `maxImageDimension1D` as the maximum image extent in the width axis, `maxImageDimension2D` as the maximum image extent in the height axis, and `maxImageDimension3D` as the maximum image extent in the depth axis. In reality, `maxImageDimension3D` represents "the largest dimension (`width`, `height`, or `depth`) that is guaranteed to be supported for all images created with an `imageType` of `VK_IMAGE_TYPE_3D`". To properly guard against exceeding device limits, this misconception must be rectified. As an additional consequence, the maximum image extent allowed for 3D tensors is much smaller than previously thought. An example maximum extents for Adreno 740: ``` maxImageDimension1D 16384 maxImageDimension2D 16384 maxImageDimension3D 2048 ``` Evidently, `maxImageDimension3D` is 8 times smaller than `maxImageDimension2D` or `maxImageDimension1D`. The exact ratio will be different depending on the GPU (I believe on some GPUs it might even be the same) but in general this knowledge reduces the threshold at which tensors can be represented via `Texture3D`. Anecdotally, I have also observed that on Adreno it is possible to allocate 3D images with extents that exceed `maxImageDimension3D` and accessing these textures within a compute shader works fine as well. But I will have to do some more research to determine if I am just getting lucky not being impacted by undefined behaviour, or if the reported `maxImageDimension3D` is not entirely accurate. To use texture storage for larger tensors, the `Texture2D` storage type should be used instead of `Texture3D`. ## Changes Changed the int4 linear operator to use buffer storage type for scales and zeros. The storage type is not selected dynamically in the interest of reducing the number of shader variants that willl need to be generated. Changed the int4 linear operator to use `Texture2D` for quantized weights instead of `Texture3D` which should be a perf boost as well as increasing the threshold for which texture storage can still be used. When checking if image extents are within physical limits, use `maxImageDimension3D` only instead of treating `{maxImageDimension1D, maxImageDimension2D, maxImageDimension3D}` as separate components. Before allocating a buffer or texture resource for a tensor, check that the resource fits within physical device limits. ghstack-source-id: 278225007 Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/)

Pull Request resolved: #10030 ## Context Some Vulkan devices do not have support for 8-bit buffers, which is currently required to execute the int4 linear compute shader due to the prepacking shader requiring it. This diff bypasses that restriction by introducing a variant of the prepacking shader that does not need 8-bit buffers. ## Changes Introduce a variant of the int4 weight prepacking shader that interprets the tensor data as an array of `uint` instead of `uint8_t`. Each `uint` represents 4 `uint8_t` values. ghstack-source-id: 278225004 Differential Revision: [D72750897](https://our.internmc.facebook.com/intern/diff/D72750897/)

## Context As title. Add an alternative compute shader for int4 weight-only quantized linear that utilizes a co-operative algorithm. This shader is more performant than standard tiled algorithms for `gemv` cases, i.e. when `mat1` is a vector rather than a matrix. ## Changes * Add the cooperative shader * Use the cooperative shader when the height of `mat1` is 1 Differential Revision: [D73044650](https://our.internmc.facebook.com/intern/diff/D73044650/) ghstack-source-id: 278225002 Pull Request resolved: #10204

…d linear ## Context As title. Update the default compute shader for weight-only quantized int4 linear to use a tiled algorithm, which should boost performance for `gemm` cases, i.e. where `mat1` is a matrix. ## Changes * Changed `q_4w_linear` name to `q_4w_linear_tiled` name * Update the compute shader to use tiled algorithm Using a value of 3 for `TILE_ROWS`; I expect to add variants which switch between different output tile configurations. Differential Revision: [D73044649](https://our.internmc.facebook.com/intern/diff/D73044649/) ghstack-source-id: 278225005 Pull Request resolved: #10205

pytorch-bot · 2025-04-16T18:52:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10236

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-04-16T19:09:56Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…d linear (pytorch#10236) ## Context As title. Update the default compute shader for weight-only quantized int4 linear to use a tiled algorithm, which should boost performance for `gemm` cases, i.e. where `mat1` is a matrix. ## Changes * Changed `q_4w_linear` name to `q_4w_linear_tiled` name * Update the compute shader to use tiled algorithm Using a value of 3 for `TILE_ROWS`; I expect to add variants which switch between different output tile configurations. Differential Revision: [D73044649](https://our.internmc.facebook.com/intern/diff/D73044649/)

SS-JIA added 4 commits April 15, 2025 09:55

pytorchbot requested a review from SS-JIA as a code owner April 16, 2025 18:52

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2025

Base automatically changed from gh/SS-JIA/212/orig to main April 16, 2025 19:01

SS-JIA approved these changes Apr 16, 2025

View reviewed changes

merge

64d41e0

SS-JIA added 2 commits April 16, 2025 12:10

fix merge

60e61e2

fix strange merge

49c483b

SS-JIA merged commit 9154002 into main Apr 16, 2025
76 of 79 checks passed

SS-JIA deleted the gh/SS-JIA/213/orig branch April 16, 2025 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10236

[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10236

pytorchbot commented Apr 16, 2025

pytorch-bot bot commented Apr 16, 2025 •

edited

Loading

github-actions bot commented Apr 16, 2025

[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10236

[ET-VK] Use performant tiled algorithm for 4 bit weight only quantized linear #10236

Conversation

pytorchbot commented Apr 16, 2025

pytorch-bot bot commented Apr 16, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10236

github-actions bot commented Apr 16, 2025

This PR needs a release notes: label

pytorch-bot bot commented Apr 16, 2025 •

edited

Loading

This PR needs a `release notes:` label