[mlir] How to best avoid masking in this case?

## High-level problem description

We want to compile `linalg.mmt4d` via "scalable" vectorisation:

```mlir
%out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) 
                   outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

```

In our case, the "N" dimension (i.e. the 2nd parallel dimensions) has been tiled using "scalable" tiles that are `4 x vscale` (this is always a multiple of a native-vector-size):
```mlir
  %pad_val = arith.constant 123 : i32

  // Scalable tile size
  %vs = vector.vscale
  %c4 = arith.constant 4 : index
  %tile_size = arith.muli %c4, %vs : index

  // Initialise the output for linalg.pack
  %rhs_empty = tensor.empty(%dim, %tile_size) : tensor<?x2x?x8xi32>

  %rhs = linalg.pack %B
    padding_value(%pad_val : i32)
    inner_dims_pos = [1, 0]                    // Transposition!
    inner_tiles = [%tile_size, 8]
    into %rhs_empty : tensor<15x8xi32> -> tensor<?x2x?x8xi32>
```

Since packing includes padding, in this particular example we know that `%rhs` can be split evenly into vectors of size `4 x vscale`. Specifically:

* _Masking should not be required._

**QUESTION:** How can we leverage this high-level information and avoid masking?

## The Current Output From The Vectorizer (and what makes it tricky)

This current MLIR output from the vectorizer is perfectly valid.

<details>

<summary>Expand To see MLIR Vectorization Ouptut</summary>

```mlir
#map = affine_map<(d0)[s0, s1] -> (-d0 + s0, s1)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, 0, d1, d2, 0, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (0, d0, d1, 0, d2, d3)>
module {
  func.func @main(%arg0: tensor<2x2x4x8xi8>, %arg1: tensor<?x2x?x8xi8>, %arg2: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
    %c0_i32 = arith.constant 0 : i32
    %c0_i8 = arith.constant 0 : i8
    %c8 = arith.constant 8 : index
    %c4 = arith.constant 4 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg1, %c0 : tensor<?x2x?x8xi8>
    %dim_0 = tensor.dim %arg1, %c2 : tensor<?x2x?x8xi8>
    %vscale = vector.vscale
    %c4_vscale = arith.muli %vscale, %c4 : index
    %0 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %arg2) -> (tensor<2x?x4x?xi32>) {
      %1 = scf.for %arg5 = %c0 to %dim step %c1 iter_args(%arg6 = %arg4) -> (tensor<2x?x4x?xi32>) {
        %2 = scf.for %arg7 = %c0 to %dim_0 step %c4_vscale iter_args(%arg8 = %arg6) -> (tensor<2x?x4x?xi32>) {
          %3 = affine.min #map(%arg7)[%dim_0, %c4_vscale]
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %3, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %4 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %3, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %5 = vector.transfer_read %extracted_slice_3[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map1} : tensor<1x1x4x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
            %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>
            %8 = vector.create_mask %c1, %c1, %c4, %3 : vector<1x1x4x[4]xi1>
            %9 = vector.mask %8 { vector.transfer_read %extracted_slice_5[%c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true, true, true]} : tensor<1x1x4x?xi32>, vector<1x1x4x[4]xi32> } : vector<1x1x4x[4]xi1> -> vector<1x1x4x[4]xi32>
            %10 = arith.extsi %5 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %11 = arith.extsi %7 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %12 = arith.muli %10, %11 : vector<1x1x1x4x[4]x8xi32>
            %13 = vector.create_mask %c1, %c1, %c1, %c4, %3, %c8 : vector<1x1x1x4x[4]x8xi1>
            %14 = vector.mask %13 { vector.multi_reduction <add>, %12, %9 [2, 5] : vector<1x1x1x4x[4]x8xi32> to vector<1x1x4x[4]xi32> } : vector<1x1x1x4x[4]x8xi1> -> vector<1x1x4x[4]xi32>
            %15 = vector.mask %8 { vector.transfer_write %14, %extracted_slice_5[%c0, %c0, %c0, %c0] {in_bounds = [true, true, true, true]} : vector<1x1x4x[4]xi32>, tensor<1x1x4x?xi32> } : vector<1x1x4x[4]xi1> -> tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %15 into %arg10[0, 0, 0, 0] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %4 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %3] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        scf.yield %2 : tensor<2x?x4x?xi32>
      }
      scf.yield %1 : tensor<2x?x4x?xi32>
    }
    return %0 : tensor<2x?x4x?xi32>
  }
```
</details>

However, the **shapes** in these Ops are very **challenging**:
```mlir
   %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
   %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>
```
Note **non-trailing scalable dimensions** - that's currently not supported.

<details>

<summary>Expand For Full MLIR Reproducer</summary>

```mlir
func.func @main(%lhs: tensor<2x2x4x8xi8>, %rhs:tensor<?x2x?x8xi8>, %acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
  %out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

  return %out : tensor<2x?x4x?xi32>
}

module @transforms attributes { transform.with_named_sequence } {
  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
    %func = transform.structured.match ops{["func.func"]} in %module : (!transform.any_op) -> !transform.op<"func.func">
    %mmt4d = transform.structured.match ops{["linalg.mmt4d"]} in %module : (!transform.any_op) -> !transform.any_op

    // Step 1: Tile
    // Tile parallel dims
    %tiled_linalg_op_p, %loops:4 = transform.structured.tile_using_for %mmt4d tile_sizes [1, 1, 0, 4, [4], 0]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">)
    // Tile reduction dims
    %tiled_linalg_op_r, %loops2:2 = transform.structured.tile_using_for %tiled_linalg_op_p tile_sizes [0, 0, 1, 0, 0, 8]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">)

    // Step 2: Vectorize
    transform.structured.vectorize %tiled_linalg_op_r vector_sizes [1, 1, 1, 4, [4], 8] : !transform.any_op

    transform.yield
  }
}
```

</details>

**NOTE:** You will need apply this diff to unblock "scalable" vectorization of `linalg.mmt4d`:
```diff
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index ff28bd7c4834..3a34dfa06c67 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -2465,6 +2465,7 @@ vectorizeScalableVectorPrecondition(Operation *op,
   // presence of scalable vectors
   return success(isElementwise(linalgOp) || isa<linalg::MatmulOp>(op) ||
                  isa<linalg::MatmulTransposeAOp>(op) ||
+                 isa<linalg::Mmt4DOp>(op) ||
                  isa<linalg::DepthwiseConv1DNwcWcOp>(op) ||
                  isa<linalg::MatvecOp>(op) || hasReductionIterator(linalgOp));
 }
```

## Is Code-Gen Easier Without Masking? Yes!

Without masks, we are left with this `vector.transfer_read`:
```mlir
   %read = vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8>
```

Under the assumption that the input came from `linalg.pack`, we know that all accesses are "in-bounds" and can "flatten" this `vector.transfer_read` post-bufferisation (see: https://github.com/llvm/llvm-project/pull/143146).

Put differently, assumptions that lead is to avoiding masks also make dealing with this `vector.transfer_read` much easier.

## How To Avoid Masking?

There are ~2 options: 
* **Option 1:** Assume that the inputs to `linalg.mm4td` are multiples of "native-vector-sizes".
  * With this knowledge (like in our case), it is safe to conclude that masks are not required.
  * To me this feels like a realistic assumption. 
  * Note, we will not be able to verify this at compile-time.
* **Option 2:** Allow disabling masking through a `bool`
  * This could like:`transform.structured.vectorize %mmt4d_main vector_sizes [1, 1, 1, 4, [4], 8] {disable_masking} : !transform.any_op`
  * Ultimtaely, this is a "short-cut" and we should somehow restrict to avoid it being excessively exploited.

## What Are The Alternatives? (when masks are present)

### Alternative 1

Flatten the mask so that `vector.transfer_read` can also be flattened:
```mlir
   %6 = vector.create_mask %c1, %c1, %3, %c8 : vector<1x1x[4]x8xi1>
   %7 = vector.mask %6 { vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8> } : vector<1x1x[4]x8xi1> -> vector<1x1x1x4x[4]x8xi8>
```

This will require a bit of vector "packing" + "unpacking". In principle, the original 2D mask could look like this:
```
**** **XX
**** **XX
**** **XX
XXXX XXXX
```
Flattened 1D mask would look like this:
```
**** **XX **** **XX **** **XX XXXX XXXX
```

This should be achievable with [vector.scalable.insert](https://mlir.llvm.org/docs/Dialects/Vector/#vectorscalableinsert-vectorscalableinsertop), but will not be pretty if we want full generality :) Importantly, this would introduce a loop over the scalable dimension inside a "hot" kernel/loop - not great for performance.

Also, in practice it should not be needed if we know that the underlying mask is "all one".

### Alternative 2

Below is a modified example that will peel the loop that contains scalable sizes. Note:
* **The main loop** contains no masks (post-canonicalization). That's great an easy to handle!
* **The remainder loop**, had it been vectorized, would contain masks. But we know that we will never "hit" that code (due to packing) and could lower that to scalar code and ignore vectorization in that case.

<details>

<summary>Expand To See MLIR Peeling + Vectorization Ouptut </summary>

```mlir
#map = affine_map<()[s0, s1] -> (s0 - s0 mod s1)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, 0, d1, d2, 0, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (0, d0, d1, 0, d2, d3)>
#map3 = affine_map<(d0)[s0] -> (-d0 + s0)>
module {
  func.func @main(%arg0: tensor<2x2x4x8xi8>, %arg1: tensor<?x2x?x8xi8>, %arg2: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
    %c0_i32 = arith.constant 0 : i32
    %c0_i8 = arith.constant 0 : i8
    %c4 = arith.constant 4 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg1, %c0 : tensor<?x2x?x8xi8>
    %dim_0 = tensor.dim %arg1, %c2 : tensor<?x2x?x8xi8>
    %vscale = vector.vscale
    %c4_vscale = arith.muli %vscale, %c4 : index
    %0 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %arg2) -> (tensor<2x?x4x?xi32>) {
      %1 = scf.for %arg5 = %c0 to %dim step %c1 iter_args(%arg6 = %arg4) -> (tensor<2x?x4x?xi32>) {
        %2 = affine.apply #map()[%dim_0, %c4_vscale]
        %3 = scf.for %arg7 = %c0 to %2 step %c4_vscale iter_args(%arg8 = %arg6) -> (tensor<2x?x4x?xi32>) {
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %c4_vscale, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %5 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %c4_vscale, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %6 = vector.transfer_read %extracted_slice_3[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map1} : tensor<1x1x4x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %7 = vector.transfer_read %extracted_slice_4[%c0, %c0, %c0, %c0], %c0_i8 {in_bounds = [true, true, true, true, true, true], permutation_map = #map2} : tensor<1x1x?x8xi8>, vector<1x1x1x4x[4]x8xi8>
            %8 = vector.transfer_read %extracted_slice_5[%c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true, true, true]} : tensor<1x1x4x?xi32>, vector<1x1x4x[4]xi32>
            %9 = arith.extsi %6 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %10 = arith.extsi %7 : vector<1x1x1x4x[4]x8xi8> to vector<1x1x1x4x[4]x8xi32>
            %11 = arith.muli %9, %10 : vector<1x1x1x4x[4]x8xi32>
            %12 = vector.multi_reduction <add>, %11, %8 [2, 5] : vector<1x1x1x4x[4]x8xi32> to vector<1x1x4x[4]xi32>
            %13 = vector.transfer_write %12, %extracted_slice_5[%c0, %c0, %c0, %c0] {in_bounds = [true, true, true, true]} : vector<1x1x4x[4]xi32>, tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %13 into %arg10[0, 0, 0, 0] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %5 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %c4_vscale] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        %4 = scf.for %arg7 = %2 to %dim_0 step %c4_vscale iter_args(%arg8 = %3) -> (tensor<2x?x4x?xi32>) {
          %5 = affine.apply #map3(%arg7)[%dim_0]
          %extracted_slice = tensor.extract_slice %arg0[%arg3, 0, 0, 0] [1, 2, 4, 8] [1, 1, 1, 1] : tensor<2x2x4x8xi8> to tensor<1x2x4x8xi8>
          %extracted_slice_1 = tensor.extract_slice %arg1[%arg5, 0, %arg7, 0] [1, 2, %5, 8] [1, 1, 1, 1] : tensor<?x2x?x8xi8> to tensor<1x2x?x8xi8>
          %extracted_slice_2 = tensor.extract_slice %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<2x?x4x?xi32> to tensor<1x1x4x?xi32>
          %6 = scf.for %arg9 = %c0 to %c2 step %c1 iter_args(%arg10 = %extracted_slice_2) -> (tensor<1x1x4x?xi32>) {
            %extracted_slice_3 = tensor.extract_slice %extracted_slice[0, %arg9, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x2x4x8xi8> to tensor<1x1x4x8xi8>
            %extracted_slice_4 = tensor.extract_slice %extracted_slice_1[0, %arg9, 0, 0] [1, 1, %5, 8] [1, 1, 1, 1] : tensor<1x2x?x8xi8> to tensor<1x1x?x8xi8>
            %extracted_slice_5 = tensor.extract_slice %arg10[0, 0, 0, 0] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> to tensor<1x1x4x?xi32>
            %7 = linalg.mmt4d ins(%extracted_slice_3, %extracted_slice_4 : tensor<1x1x4x8xi8>, tensor<1x1x?x8xi8>) outs(%extracted_slice_5 : tensor<1x1x4x?xi32>) -> tensor<1x1x4x?xi32>
            %inserted_slice_6 = tensor.insert_slice %7 into %arg10[0, 0, 0, 0] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<1x1x4x?xi32>
            scf.yield %inserted_slice_6 : tensor<1x1x4x?xi32>
          }
          %inserted_slice = tensor.insert_slice %6 into %arg8[%arg3, %arg5, 0, %arg7] [1, 1, 4, %5] [1, 1, 1, 1] : tensor<1x1x4x?xi32> into tensor<2x?x4x?xi32>
          scf.yield %inserted_slice : tensor<2x?x4x?xi32>
        }
        scf.yield %4 : tensor<2x?x4x?xi32>
      }
      scf.yield %1 : tensor<2x?x4x?xi32>
    }
    return %0 : tensor<2x?x4x?xi32>
  }
```
</details>

<details>

<summary>Expand For A Full MLIR Reproducer</summary>

```mlir
func.func @main(%lhs: tensor<2x2x4x8xi8>, %rhs:tensor<?x2x?x8xi8>, %acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32> {
  %out = linalg.mmt4d ins(%lhs, %rhs: tensor<2x2x4x8xi8>, tensor<?x2x?x8xi8>) outs(%acc: tensor<2x?x4x?xi32>) -> tensor<2x?x4x?xi32>

  return %out : tensor<2x?x4x?xi32>
}

module @transforms attributes { transform.with_named_sequence } {
  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
    %func = transform.structured.match ops{["func.func"]} in %module : (!transform.any_op) -> !transform.op<"func.func">
    %mmt4d = transform.structured.match ops{["linalg.mmt4d"]} in %module : (!transform.any_op) -> !transform.any_op

    // Step 1: Tile
    // Tile parallel dims
    %tiled_linalg_op_p, %loops:4 = transform.structured.tile_using_for %mmt4d tile_sizes [1, 1, 0, 4, [4], 0]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.op<"scf.for">)
    // Tile reduction dims
    %tiled_linalg_op_r, %loops2:2 = transform.structured.tile_using_for %tiled_linalg_op_p tile_sizes [0, 0, 1, 0, 0, 8]
     : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">)

    // 2. Loop peeling (only the middle dimension)
    %main_loop, %remainder_loop = transform.loop.peel %loops#3 : (!transform.op<"scf.for">) -> (!transform.op<"scf.for">, !transform.op<"scf.for">)

    // Step 3: Vectorize
    %mmt4d_main = transform.structured.match ops{["linalg.mmt4d"]} in %main_loop : (!transform.op<"scf.for">) -> !transform.any_op
    transform.structured.vectorize %mmt4d_main vector_sizes [1, 1, 1, 4, [4], 8] : !transform.any_op

    transform.yield
  }
}
```

</details>

## Other Questions

**Question 1:** Is it safe to assume that the input to `linalg.mmt4d` comes from `linalg.pack` + `linalg.unpack`?
 * AFAIK, `linalg.mmt4d` is never used as a standalone operation.
 * If this assumption is true, then we should document it.

**Question 2:** Wouldn't it be impractical to use "non-powers of native-vector-size" in `linalg.mmt4d` inputs?
 * How can we capture this? Via attributes?
 * While "impractical", requiring "powers of native-vector-size" would be prohibitively restrictive.

## Next Steps

While we do want to look into **Alternative 1** and **Alternative 2**, we would like to leave it as TODOs for a _very near_ future. In the meantime, adding a `bool` to disable masking (this could be specific to `linalg.mmt4d`) would unblock some other work for us.

CC @dcaballe 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir] How to best avoid masking in this case? #143920

High-level problem description

The Current Output From The Vectorizer (and what makes it tricky)

Is Code-Gen Easier Without Masking? Yes!

How To Avoid Masking?

What Are The Alternatives? (when masks are present)

Alternative 1

Alternative 2

Other Questions

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[mlir] How to best avoid masking in this case? #143920

Description

High-level problem description

The Current Output From The Vectorizer (and what makes it tricky)

Is Code-Gen Easier Without Masking? Yes!

How To Avoid Masking?

What Are The Alternatives? (when masks are present)

Alternative 1

Alternative 2

Other Questions

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions