Skip to content

[mlir] How to best avoid masking in this case? #143920

Open
@banach-space

Description

@banach-space

High-level problem description

We want to compile linalg.mmt4d via "scalable" vectorisation:

%out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) 
                   outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

In our case, the "N" dimension (i.e. the 2nd parallel dimensions) has been tiled using "scalable" tiles that are 4 x vscale (this is always a multiple of a native-vector-size):

  %pad_val = arith.constant 123 : i32

  // Scalable tile size
  %vs = vector.vscale
  %c4 = arith.constant 4 : index
  %tile_size = arith.muli %c4, %vs : index

  // Initialise the output for linalg.pack
  %rhs_empty = tensor.empty(%dim, %tile_size) : tensor<?x2x?x8xi32>

  %rhs = linalg.pack %B
    padding_value(%pad_val : i32)
    inner_dims_pos = [1, 0]                    // Transposition!
    inner_tiles = [%tile_size, 8]
    into %rhs_empty : tensor<15x8xi32> -> tensor<?x2x?x8xi32>

Since packing includes padding, in this particular example we know that %rhs can be split evenly into vectors of size 4 x vscale. Specifically:

  • Masking should not be required.

QUESTION: How can we leverage this high-level information and avoid masking?

The Current Output From The Vectorizer (and what makes it tricky)

This current MLIR output from the vectorizer is perfectly valid.

Expand To see MLIR Vectorization Ouptut
#map = affine_map<(d0)[s0, s1] -> (-d0 + s0, s1)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, 0, d1, d2, 0, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (0, d0, d1, 0, d2, d3)>
module {
  func.func @main(%arg0: tensor<2x2x4x8xi8>, %arg1: tensor<?x2x?x8xi8>, %arg2: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
    %c0_i32 = arith.constant 0 : i32
    %c0_i8 = arith.constant 0 : i8
    %c8 = arith.constant 8 : index
    %c4 = arith.constant 4 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg1, %c0 : tensor<?x2x?x8xi8>
    %dim_0 = tensor.dim %arg1, %c2 : tensor<?x2x?x8xi8>
    %vscale = vector.vscale
    %c4_vscale = arith.muli %vscale, %c4 : index
    %0 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %arg2) -> (tensor<2x?x4x?xi32>) {
      %1 = scf.for %arg5 = %c0 to %dim step %c1 iter_args(%arg6 = %arg4) -> (tensor<2x?x4x?xi32>) {
        %2 = scf.for %arg7 = %c0 to %dim_0 step %c4_vscale iter_args(%arg8 = %arg6) -> (tensor<2x?x4x?xi32>) {
          %3 = affine.min #map(%arg7)[%dim_0, %c4_vscale]
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %3, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %4 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %3, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %5 = vector.transfer_read %extracted_slice_3[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map1} : tensor<1x1x4x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
            %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>
            %8 = vector.create_mask %c1, %c1, %c4, %3 : vector<1x1x4x[4]xi1>
            %9 = vector.mask %8 { vector.transfer_read %extracted_slice_5[%c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true, true, true]} : tensor<1x1x4x?xi32>, vector<1x1x4x[4]xi32> } : vector<1x1x4x[4]xi1> -> vector<1x1x4x[4]xi32>
            %10 = arith.extsi %5 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %11 = arith.extsi %7 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %12 = arith.muli %10, %11 : vector<1x1x1x4x[4]x8xi32>
            %13 = vector.create_mask %c1, %c1, %c1, %c4, %3, %c8 : vector<1x1x1x4x[4]x8xi1>
            %14 = vector.mask %13 { vector.multi_reduction <add>, %12, %9 [2, 5] : vector<1x1x1x4x[4]x8xi32> to vector<1x1x4x[4]xi32> } : vector<1x1x1x4x[4]x8xi1> -> vector<1x1x4x[4]xi32>
            %15 = vector.mask %8 { vector.transfer_write %14, %extracted_slice_5[%c0, %c0, %c0, %c0] {in_bounds = [true, true, true, true]} : vector<1x1x4x[4]xi32>, tensor<1x1x4x?xi32> } : vector<1x1x4x[4]xi1> -> tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %15 into %arg10[0, 0, 0, 0] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %4 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        scf.yield %2 : tensor<2x?x4x?xi32>
      }
      scf.yield %1 : tensor<2x?x4x?xi32>
    }
    return %0 : tensor<2x?x4x?xi32>
  }

However, the shapes in these Ops are very challenging:

   %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
   %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>

Note non-trailing scalable dimensions - that's currently not supported.

Expand For Full MLIR Reproducer
func.func @main(%lhs: tensor<2x2x4x8xi8>, %rhs:tensor<?x2x?x8xi8>, %acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
  %out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

  return %out : tensor<2x?x4x?xi32>
}

module @transforms attributes { transform.with_named_sequence } {
  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
    %func = transform.structured.match ops{["func.func"]} in %module : (!transform.any_op) -> !transform.op<"func.func">
    %mmt4d = transform.structured.match ops{["linalg.mmt4d"]} in %module : (!transform.any_op) -> !transform.any_op

    // Step 1: Tile
    // Tile parallel dims
    %tiled_linalg_op_p, %loops:4 = transform.structured.tile_using_for %mmt4d tile_sizes [1, 1, 0, 4, [4], 0]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">)
    // Tile reduction dims
    %tiled_linalg_op_r, %loops2:2 = transform.structured.tile_using_for %tiled_linalg_op_p tile_sizes [0, 0, 1, 0, 0, 8]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">)

    // Step 2: Vectorize
    transform.structured.vectorize %tiled_linalg_op_r vector_sizes [1, 1, 1, 4, [4], 8] : !transform.any_op

    transform.yield
  }
}

NOTE: You will need apply this diff to unblock "scalable" vectorization of linalg.mmt4d:

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index ff28bd7c4834..3a34dfa06c67 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -2465,6 +2465,7 @@ vectorizeScalableVectorPrecondition(Operation *op,
   // presence of scalable vectors
   return success(isElementwise(linalgOp) || isa<linalg::MatmulOp>(op) ||
                  isa<linalg::MatmulTransposeAOp>(op) ||
+                 isa<linalg::Mmt4DOp>(op) ||
                  isa<linalg::DepthwiseConv1DNwcWcOp>(op) ||
                  isa<linalg::MatvecOp>(op) || hasReductionIterator(linalgOp));
 }

Is Code-Gen Easier Without Masking? Yes!

Without masks, we are left with this vector.transfer_read:

   %read = vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8>

Under the assumption that the input came from linalg.pack, we know that all accesses are "in-bounds" and can "flatten" this vector.transfer_read post-bufferisation (see: #143146).

Put differently, assumptions that lead is to avoiding masks also make dealing with this vector.transfer_read much easier.

How To Avoid Masking?

There are ~2 options:

  • Option 1: Assume that the inputs to linalg.mm4td are multiples of "native-vector-sizes".
    • With this knowledge (like in our case), it is safe to conclude that masks are not required.
    • To me this feels like a realistic assumption.
    • Note, we will not be able to verify this at compile-time.
  • Option 2: Allow disabling masking through a bool
    • This could like:transform.structured.vectorize %mmt4d_main vector_sizes [1, 1, 1, 4, [4], 8] {disable_masking} : !transform.any_op
    • Ultimtaely, this is a "short-cut" and we should somehow restrict to avoid it being excessively exploited.

What Are The Alternatives? (when masks are present)

Alternative 1

Flatten the mask so that vector.transfer_read can also be flattened:

   %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
   %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>

This will require a bit of vector "packing" + "unpacking". In principle, the original 2D mask could look like this:

**** **XX
**** **XX
**** **XX
XXXX XXXX

Flattened 1D mask would look like this:

**** **XX **** **XX **** **XX XXXX XXXX

This should be achievable with vector.scalable.insert, but will not be pretty if we want full generality :) Importantly, this would introduce a loop over the scalable dimension inside a "hot" kernel/loop - not great for performance.

Also, in practice it should not be needed if we know that the underlying mask is "all one".

Alternative 2

Below is a modified example that will peel the loop that contains scalable sizes. Note:

  • The main loop contains no masks (post-canonicalization). That's great an easy to handle!
  • The remainder loop, had it been vectorized, would contain masks. But we know that we will never "hit" that code (due to packing) and could lower that to scalar code and ignore vectorization in that case.
Expand To See MLIR Peeling + Vectorization Ouptut
#map = affine_map<()[s0, s1] -> (s0 - s0 mod s1)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, 0, d1, d2, 0, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (0, d0, d1, 0, d2, d3)>
#map3 = affine_map<(d0)[s0] -> (-d0 + s0)>
module {
  func.func @main(%arg0: tensor<2x2x4x8xi8>, %arg1: tensor<?x2x?x8xi8>, %arg2: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
    %c0_i32 = arith.constant 0 : i32
    %c0_i8 = arith.constant 0 : i8
    %c4 = arith.constant 4 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg1, %c0 : tensor<?x2x?x8xi8>
    %dim_0 = tensor.dim %arg1, %c2 : tensor<?x2x?x8xi8>
    %vscale = vector.vscale
    %c4_vscale = arith.muli %vscale, %c4 : index
    %0 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %arg2) -> (tensor<2x?x4x?xi32>) {
      %1 = scf.for %arg5 = %c0 to %dim step %c1 iter_args(%arg6 = %arg4) -> (tensor<2x?x4x?xi32>) {
        %2 = affine.apply #map()[%dim_0, %c4_vscale]
        %3 = scf.for %arg7 = %c0 to %2 step %c4_vscale iter_args(%arg8 = %arg6) -> (tensor<2x?x4x?xi32>) {
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %c4_vscale, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %5 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %c4_vscale, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %6 = vector.transfer_read %extracted_slice_3[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map1} : tensor<1x1x4x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %7 = vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %8 = vector.transfer_read %extracted_slice_5[%c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true, true, true]} : tensor<1x1x4x?xi32>, vector<1x1x4x[4]xi32>
            %9 = arith.extsi %6 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %10 = arith.extsi %7 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %11 = arith.muli %9, %10 : vector<1x1x1x4x[4]x8xi32>
            %12 = vector.multi_reduction <add>, %11, %8 [2, 5] : vector<1x1x1x4x[4]x8xi32> to vector<1x1x4x[4]xi32>
            %13 = vector.transfer_write %12, %extracted_slice_5[%c0, %c0, %c0, %c0] {in_bounds = [true, true, true, true]} : vector<1x1x4x[4]xi32>, tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %13 into %arg10[0, 0, 0, 0] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %5 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        %4 = scf.for %arg7 = %2 to %dim_0 step %c4_vscale iter_args(%arg8 = %3) -> (tensor<2x?x4x?xi32>) {
          %5 = affine.apply #map3(%arg7)[%dim_0]
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %5, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %6 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %5, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %7 = linalg.mmt4d ins(%extracted_slice_3, %extracted_slice_4 : tensor<1x1x4x8xi8>, tensor<1x1x?x8xi8>) outs(%extracted_slice_5 : tensor<1x1x4x?xi32>) -> tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %7 into %arg10[0, 0, 0, 0] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %6 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        scf.yield %4 : tensor<2x?x4x?xi32>
      }
      scf.yield %1 : tensor<2x?x4x?xi32>
    }
    return %0 : tensor<2x?x4x?xi32>
  }
Expand For A Full MLIR Reproducer
func.func @main(%lhs: tensor<2x2x4x8xi8>, %rhs:tensor<?x2x?x8xi8>, %acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
  %out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

  return %out : tensor<2x?x4x?xi32>
}

module @transforms attributes { transform.with_named_sequence } {
  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
    %func = transform.structured.match ops{["func.func"]} in %module : (!transform.any_op) -> !transform.op<"func.func">
    %mmt4d = transform.structured.match ops{["linalg.mmt4d"]} in %module : (!transform.any_op) -> !transform.any_op

    // Step 1: Tile
    // Tile parallel dims
    %tiled_linalg_op_p, %loops:4 = transform.structured.tile_using_for %mmt4d tile_sizes [1, 1, 0, 4, [4], 0]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">)
    // Tile reduction dims
    %tiled_linalg_op_r, %loops2:2 = transform.structured.tile_using_for %tiled_linalg_op_p tile_sizes [0, 0, 1, 0, 0, 8]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">)

    // 2. Loop peeling (only the middle dimension)
    %main_loop, %remainder_loop = transform.loop.peel %loops#3 : (!transform.op<"scf.for">) -> (!transform.op<"scf.for">, !transform.op<"scf.for">)

    // Step 3: Vectorize
    %mmt4d_main = transform.structured.match ops{["linalg.mmt4d"]} in %main_loop : (!transform.op<"scf.for">) -> !transform.any_op
    transform.structured.vectorize %mmt4d_main vector_sizes [1, 1, 1, 4, [4], 8] : !transform.any_op

    transform.yield
  }
}

Other Questions

Question 1: Is it safe to assume that the input to linalg.mmt4d comes from linalg.pack + linalg.unpack?

  • AFAIK, linalg.mmt4d is never used as a standalone operation.
  • If this assumption is true, then we should document it.

Question 2: Wouldn't it be impractical to use "non-powers of native-vector-size" in linalg.mmt4d inputs?

  • How can we capture this? Via attributes?
  • While "impractical", requiring "powers of native-vector-size" would be prohibitively restrictive.

Next Steps

While we do want to look into Alternative 1 and Alternative 2, we would like to leave it as TODOs for a very near future. In the meantime, adding a bool to disable masking (this could be specific to linalg.mmt4d) would unblock some other work for us.

CC @dcaballe

Metadata

Metadata

Assignees

Labels

mlirquestionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions