Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing support for semi-affine modulo operations in Affine dialect #9279

Closed
dcaballe opened this issue Jun 2, 2022 · 3 comments
Closed
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend codegen Shared code generation infrastructure and dialects help wanted Extra attention is needed

Comments

@dcaballe
Copy link
Contributor

dcaballe commented Jun 2, 2022

After fixing 9244, the peeling transformation leads to semi-affine modulo operations (non-constant RHS) that are not already supported in Affine and can't be lowered to the Arithmetic dialect.

Reproducer:

func.func @simple_mul_dispatch_0() {
  %c4 = arith.constant 4 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : memref<4xf32>
  memref.assume_alignment %0, 64 : memref<4xf32>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : memref<4xf32>
  memref.assume_alignment %1, 64 : memref<4xf32>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c0) alignment(64) : memref<4xf32>
  memref.assume_alignment %2, 64 : memref<4xf32>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_x]
  %4 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_x]
  %5 = affine.apply affine_map<()[s0, s1] -> (-((s0 * -4 + 4) mod (s1 * 4)) + 4)>()[%workgroup_id_x, %workgroup_count_x]
  cf.br ^bb1(%3 : index)
^bb1(%6: index):  // 2 preds: ^bb0, ^bb2
  %7 = arith.cmpi slt, %6, %5 : index
  cf.cond_br %7, ^bb2, ^bb3(%5 : index)
^bb2:  // pred: ^bb1
  %8 = memref.subview %2[%6] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %9 = memref.subview %0[%6] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %10 = memref.subview %1[%6] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %11 = vector.load %9[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %12 = vector.load %10[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %13 = arith.mulf %11, %12 : vector<4xf32>
  vector.store %13, %8[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %14 = arith.addi %6, %4 : index
  cf.br ^bb1(%14 : index)
^bb3(%15: index):  // 2 preds: ^bb1, ^bb4
  %16 = arith.cmpi slt, %15, %c4 : index
  cf.cond_br %16, ^bb4, ^bb5
^bb4:  // pred: ^bb3
  %17 = memref.subview %2[%15] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %18 = memref.subview %0[%15] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %19 = memref.subview %1[%15] [4] [1] : memref<4xf32> to memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>
  %20 = vector.load %18[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %21 = vector.load %19[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %22 = arith.mulf %20, %21 : vector<4xf32>
  vector.store %22, %17[%c0] : memref<4xf32, affine_map<(d0)[s0] -> (d0 + s0)>>, vector<4xf32>
  %23 = arith.addi %15, %4 : index
  cf.br ^bb3(%23 : index)
^bb5:  // pred: ^bb3
  return
}
iree-opt semi-affine-mod.mlir -lower-affine
semi-affine-mod.mlir:14:8: error: semi-affine expressions (modulo by non-const) are not supported                                                                                                                                         
  %5 = affine.apply affine_map<()[s0, s1] -> (-((s0 * -4 + 4) mod (s1 * 4)) + 4)>()[%workgroup_id_x, %workgroup_count_x

Same problematic affine map as in 9244. I guess we would need to add support for them but, based on the documentation (https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Affine/Utils/Utils.cpp#L65), I'm not sure what should happen for potential non-constant negative RHSs.

I also wonder how is that peeling is hitting this limitation now. Shouldn't any dynamic shape scenario lead to semi-affine modulo ops?

@matthias-springer, @nicolasvasilache, @hanhanW, @MaheshRavishankar

@dcaballe dcaballe added bug 🐞 Something isn't working help wanted Extra attention is needed codegen Shared code generation infrastructure and dialects codegen/llvm LLVM code generation compiler backend labels Jun 2, 2022
@hanhanW
Copy link
Contributor

hanhanW commented Jun 2, 2022

I feel that this is the issue about workgroup_count = 0 again... They are not compilation known for some cases (e.g., dynamic shapes, peeling, etc). In this case, they are set to zeros. If we try to simplify affine ops using zeros, it causes issues.

@dcaballe it looks like peeling on distributed loops needs more fixes. Would you like to try applying the transform on second level of tiling?

I'm curious why we want to peel on distributed loops. It makes more sense to me if we apply it in the second level of tiling. I feel that it also makes the work of each thread more balance. Is there a strong reason why we want to apply peeling on distributed loops?

@dcaballe
Copy link
Contributor Author

dcaballe commented Jun 2, 2022

We shouldn't be peeling distributed loops! Let me take a look at what is going on here but after talking to Nicolas, this seems to be a well-known limitation we will have to deal with for dynamic shapes at the very least.

@dcaballe
Copy link
Contributor Author

dcaballe commented Jun 3, 2022

I restricted the loops to peel a bit more (#9310). The problem happened because we may tile loops and then immediately remove them if they have a single iteration so the loop nest structure after tiling can't be inferred from the tile sizes itself. I hit a similar issue in the past and things can get very messy. IMO, we should be doing this kind of loop optimizations at the end and not in the middle of the strategy because we lose information when we remove loops. I hope the Transform dialect can give more flexibility here.

In any case, we can close this issue for now although we may need to fix it when we exercise dynamic shapes a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend codegen Shared code generation infrastructure and dialects help wanted Extra attention is needed
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants