-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Add MLP test and linalg.fill lowering in 'linalg-to-xegpu' #220
Conversation
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
|
||
// CHECK: Unranked Memref base@{{(0x)?[-0-9a-fA-F]*}} | ||
// CHECK-SAME: rank = 1 offset = 0 sizes = [32] strides = [4096] data = | ||
// CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344] | |
// CHECK-NEXT: [17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625] |
computing this with numpy gives different result:
import numpy as np
arg0 = np.ones(shape=(32, 4096), dtype="float16")
arg1 = np.ones(shape=(4096, 4096), dtype="float16")
arg2 = np.ones(shape=(32, 4096), dtype="float16")
arg3 = np.ones(shape=(4096, 4096), dtype="float16")
arg4 = np.ones(shape=(32, 4096), dtype="float16")
arg0[:] = 0.01
arg1[:] = 0.01
arg2[:] = 0.02
arg3[:] = 0.01
arg4[:] = 0.02
p2 = np.dot(arg0, arg1)
p4 = arg2 + p2
p5 = np.zeros(shape=(32, 4096), dtype="float16")
p7 = np.maximum(p5, p4)
p10 = np.dot(p7, arg3)
p12 = arg4 + p10
print(p12)
# array([[17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# ...,
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62]], dtype=float16)
p.s. running this test with this and this being fixed (and if tiling sizes are 2d default-tile-size=matmul:{16,16}
) makes the test produce the result that is equivalent to numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temporary ref data is produced by gc-cpu-pipeline
, will replace ref data with accurate ones. Currently bf16/f16 matmul in gc-cpu-pipeline
uses naive lowering for matmul(not doing reduce in f32 then cast to bf16/f16), thus the value is incorrect.
Another problem I encountered is larger matmul requires k-tiling but current tiling for reduce axis will not add reduce op to compensate, this will lead to correctness issue as well, will avoid k-tiling(only use 2d tiling sizes for now).
%15 = linalg.max ins(%13, %12 : tensor<32x4096xf16>, tensor<32x4096xf16>) | ||
outs(%14 : tensor<32x4096xf16>) -> tensor<32x4096xf16> | ||
|
||
return %15 : tensor<32x4096xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert-gpu-allocs
pass from IMEX seems to go crazy if we allocate & return a buffer in a function that contains gpu.launch
(linalg_mlp
in our case). For me it crashes the program at the end. It works fine though if we pass the final output buffer to linalg_mlp
, e.g
func.func @main() {
%out = tensor.empty() : <...>
func.call @linalg_mlp(..., %out)
}
If you haven't figured out a fix for this yet, we may simply stick to the 'pass output buffer to the func' model as it should be enough for our testing. We won't use insert-gpu-allocs
in our final pipeline anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I encountered this problem and haven't figured out why. I will try to use your suggestion.
@@ -0,0 +1,75 @@ | |||
// RUN: gc-opt %s --pass-pipeline='builtin.module(func.func(iterative-tiling-and-fusion{use-cost-model=0 default-tile-size=matmul:{16,16}}),eliminate-empty-tensors,empty-tensor-to-alloc-tensor,one-shot-bufferize{bufferize-function-boundaries=1 function-boundary-type-conversion=identity-layout-map},drop-equivalent-buffer-results,func.func(finalizing-bufferize),canonicalize,cse,drop-equivalent-buffer-results,expand-realloc,canonicalize,ownership-based-buffer-deallocation,canonicalize,buffer-deallocation-simplification,bufferization-lower-deallocations,cse,canonicalize,convert-bufferization-to-memref,func.func(scf-forall-to-parallel),func.func(linalg-to-xegpu{stages=1 dpas-tile=8,16,16 k-tile=16}),xegpu-fold-alias-ops,func.func(convert-linalg-to-parallel-loops),func.func(gpu-map-parallel-loops),func.func(convert-parallel-loops-to-gpu),func.func(insert-gpu-allocs),gpu-kernel-outlining,canonicalize,set-spirv-capabilities{client-api=opencl},gpu.module(set-spirv-abi-attrs{client-api=opencl}),lower-affine,imex-vector-linearize,gpu.module(convert-xegpu-to-vc),reconcile-unrealized-casts,bf16-to-gpu,gpu.module(convert-func-to-spirv),gpu.module(convert-vector-to-spirv),imex-convert-gpu-to-spirv,spirv.module(spirv-lower-abi-attrs,spirv-update-vce),func.func(llvm-request-c-wrappers),serialize-spirv,convert-vector-to-scf,convert-gpu-to-gpux,convert-scf-to-cf,convert-cf-to-llvm,convert-vector-to-llvm,convert-index-to-llvm,convert-arith-to-llvm,convert-func-to-llvm,convert-math-to-llvm,convert-gpux-to-llvm,convert-index-to-llvm,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test pass on your machine? For me it fails with the following error:
incorrect lowering for 'linalg.fill'?
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
%4 = linalg.add ins(%arg2, %2 : tensor<32x4096xf16>, tensor<32x4096xf16>)
^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
%12 = linalg.add ins(%arg4, %10 : tensor<32x4096xf16>, tensor<32x4096xf16>)
^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
If I remove all linalg.fill
from the test it then fails with another error caused by double deallocations added by insert-gpu-allocs
pass. This can be fixed with this patch to IMEX: Menooker/mlir-extensions#3 (have you applied this patch to your IMEX build? If so, we probably should merge it and update IMEX version)
free() problem
0. Program arguments: /home/jovyan/graph-compiler/build/bin/gc-cpu-runner -e main --entry-point-result=void --shared-libs=/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_runner_utils.so,/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_c_runner_utils.so,/home/jovyan/graph-compiler/build/lib/libGcOpenclRuntime.so
#0 0x0000562ee351c2a0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b82a0)
#1 0x0000562ee35193af llvm::sys::RunSignalHandlers() (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b53af)
#2 0x0000562ee3519505 SignalHandler(int) Signals.cpp:0:0
#3 0x00007f2d1e2716ac (/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so+0x5436ac)
#4 0x00007f2d544cf520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#5 0x00007f2d545323fe __libc_free (/lib/x86_64-linux-gnu/libc.so.6+0xa53fe)
#6 0x00007f2d54a097aa
#7 0x00007f2d54a0a09b
#8 0x00007f2d54a0a441
#9 0x0000562ee3ad5a0c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#10 0x0000562ee3ad5ead compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#11 0x0000562ee3ad7473 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x973473)
#12 0x0000562ee34546c0 std::vector<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>, std::allocator<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>>>::~vector() /usr/include/c++/11/bits/stl_vector.h:680:15
#13 0x0000562ee34546c0 mlir::DialectRegistry::~DialectRegistry() /home/jovyan/llvm/llvm-gc-master-patches-install/include/mlir/IR/DialectRegistry.h:139:7
#14 0x0000562ee34546c0 main /home/jovyan/graph-compiler/src/gc-cpu-runner/gc-cpu-runner.cpp:46:1
#15 0x00007f2d544b6d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#16 0x00007f2d544b6e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#17 0x0000562ee3505195 _start (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3a1195)
After removing linalg.fill
and applying the patch above to IMEX the test passes for me.
%7 = linalg.max ins(%5, %4 : tensor<32x4096xf16>, tensor<32x4096xf16>) | ||
outs(%6 : tensor<32x4096xf16>) -> tensor<32x4096xf16> | ||
|
||
%8 = tensor.empty() : tensor<32x4096xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you use it anywhere?
%8 = tensor.empty() : tensor<32x4096xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
%0 = tensor.generate { | ||
^bb0(%i : index, %j : index): | ||
tensor.yield %cst0 : f16 | ||
} : tensor<32x4096xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not?
%0 = tensor.generate { | |
^bb0(%i : index, %j : index): | |
tensor.yield %cst0 : f16 | |
} : tensor<32x4096xf16> | |
%0 = arith.constant dense<%cst0> : tensor<32x4096xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use to generate more "random" data, I will see if it is still needed.
There is a bug inside imex for constant vector store_nd. The code will produce 2D tile in gpu kernel args and reuse the llvm.genx.raw.sends2.noresult. functions that have the same name as 1D tile, thus produce args type mismatch error. |
.github/workflows/build-llvm.yml
Outdated
@@ -32,7 +32,7 @@ jobs: | |||
|
|||
- uses: actions/checkout@v4 | |||
with: | |||
repository: Menooker/mlir-extensions | |||
repository: LongshengDu/mlir-extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need smth like a staging branch in imex since most of the changes should land in the main anyway. cc @Garra1980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think anyone (with our access rights) can create a branch in the IMEX repo. May we simply create a GC-dev
branch in intel/mlir-extensions
and that would be it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's exactly what I'm suggesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think anyone (with our access rights) can create a branch in the IMEX repo
Just tried to do so and I think we don't have enough right for this :)
Need some help from @Garra1980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the branch https://github.com/intel/mlir-extensions/tree/gc-staging
cmake/imex.cmake
Outdated
gc_fetch_content(imex "${IMEX_HASH}" https://github.com/Menooker/mlir-extensions | ||
gc_fetch_content(imex "${IMEX_HASH}" https://github.com/LongshengDu/mlir-extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we merge the fix to https://github.com/Menooker/mlir-extensions/tree/dev to keep things consistent? We already have some of our patches there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, @Menooker, @LongshengDu, could you please apply the patch?
This PR aimed to demonstrate more complex workloads on GPU. We want to incorporate multi-tiling for larger matmuls(3 nested tiling loops), with pre-fusion(fill) and post-fusion(add/relu) to better exhibit real life workloads.
Added linalg.fill lowering support to 'linalg-to-xegpu', however, we do not want to keep expanding linalg-to-xegpu'in the future, other demos may need to check the supported dtype/ops in 'linalg-to-xegpu'.
Depend on #201
Tracking #219