Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Add MLP test and linalg.fill lowering in 'linalg-to-xegpu' #220

Merged
merged 28 commits into from
Sep 11, 2024

Conversation

LongshengDu
Copy link
Contributor

@LongshengDu LongshengDu commented Aug 7, 2024

This PR aimed to demonstrate more complex workloads on GPU. We want to incorporate multi-tiling for larger matmuls(3 nested tiling loops), with pre-fusion(fill) and post-fusion(add/relu) to better exhibit real life workloads.

Added linalg.fill lowering support to 'linalg-to-xegpu', however, we do not want to keep expanding linalg-to-xegpu'in the future, other demos may need to check the supported dtype/ops in 'linalg-to-xegpu'.

Depend on #201
Tracking #219

dchigarev and others added 14 commits July 31, 2024 15:34
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
@LongshengDu LongshengDu added WIP work in progress test labels Aug 7, 2024
Longsheng Du added 2 commits August 7, 2024 15:14

// CHECK: Unranked Memref base@{{(0x)?[-0-9a-fA-F]*}}
// CHECK-SAME: rank = 1 offset = 0 sizes = [32] strides = [4096] data =
// CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344]
Copy link
Contributor

@dchigarev dchigarev Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344]
// CHECK-NEXT: [17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625]

computing this with numpy gives different result:

import numpy as np

arg0 = np.ones(shape=(32, 4096), dtype="float16")
arg1 = np.ones(shape=(4096, 4096), dtype="float16")
arg2 = np.ones(shape=(32, 4096), dtype="float16")
arg3 = np.ones(shape=(4096, 4096), dtype="float16")
arg4 = np.ones(shape=(32, 4096), dtype="float16")
arg0[:] = 0.01
arg1[:] = 0.01
arg2[:] = 0.02
arg3[:] = 0.01
arg4[:] = 0.02
p2 = np.dot(arg0, arg1)
p4 = arg2 + p2
p5 = np.zeros(shape=(32, 4096), dtype="float16")
p7 = np.maximum(p5, p4)
p10 = np.dot(p7, arg3)
p12 = arg4 + p10
print(p12)
# array([[17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        ...,
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62]], dtype=float16)

p.s. running this test with this and this being fixed (and if tiling sizes are 2d default-tile-size=matmul:{16,16}) makes the test produce the result that is equivalent to numpy

Copy link
Contributor Author

@LongshengDu LongshengDu Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporary ref data is produced by gc-cpu-pipeline, will replace ref data with accurate ones. Currently bf16/f16 matmul in gc-cpu-pipeline uses naive lowering for matmul(not doing reduce in f32 then cast to bf16/f16), thus the value is incorrect.

Another problem I encountered is larger matmul requires k-tiling but current tiling for reduce axis will not add reduce op to compensate, this will lead to correctness issue as well, will avoid k-tiling(only use 2d tiling sizes for now).

Comment on lines 30 to 33
%15 = linalg.max ins(%13, %12 : tensor<32x4096xf16>, tensor<32x4096xf16>)
outs(%14 : tensor<32x4096xf16>) -> tensor<32x4096xf16>

return %15 : tensor<32x4096xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert-gpu-allocs pass from IMEX seems to go crazy if we allocate & return a buffer in a function that contains gpu.launch (linalg_mlp in our case). For me it crashes the program at the end. It works fine though if we pass the final output buffer to linalg_mlp, e.g

func.func @main() {
    %out = tensor.empty() : <...>

    func.call @linalg_mlp(..., %out)
}

If you haven't figured out a fix for this yet, we may simply stick to the 'pass output buffer to the func' model as it should be enough for our testing. We won't use insert-gpu-allocs in our final pipeline anyway.

Copy link
Contributor Author

@LongshengDu LongshengDu Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I encountered this problem and haven't figured out why. I will try to use your suggestion.

@LongshengDu LongshengDu added ready to review and removed WIP work in progress labels Aug 12, 2024
@@ -0,0 +1,75 @@
// RUN: gc-opt %s --pass-pipeline='builtin.module(func.func(iterative-tiling-and-fusion{use-cost-model=0 default-tile-size=matmul:{16,16}}),eliminate-empty-tensors,empty-tensor-to-alloc-tensor,one-shot-bufferize{bufferize-function-boundaries=1 function-boundary-type-conversion=identity-layout-map},drop-equivalent-buffer-results,func.func(finalizing-bufferize),canonicalize,cse,drop-equivalent-buffer-results,expand-realloc,canonicalize,ownership-based-buffer-deallocation,canonicalize,buffer-deallocation-simplification,bufferization-lower-deallocations,cse,canonicalize,convert-bufferization-to-memref,func.func(scf-forall-to-parallel),func.func(linalg-to-xegpu{stages=1 dpas-tile=8,16,16 k-tile=16}),xegpu-fold-alias-ops,func.func(convert-linalg-to-parallel-loops),func.func(gpu-map-parallel-loops),func.func(convert-parallel-loops-to-gpu),func.func(insert-gpu-allocs),gpu-kernel-outlining,canonicalize,set-spirv-capabilities{client-api=opencl},gpu.module(set-spirv-abi-attrs{client-api=opencl}),lower-affine,imex-vector-linearize,gpu.module(convert-xegpu-to-vc),reconcile-unrealized-casts,bf16-to-gpu,gpu.module(convert-func-to-spirv),gpu.module(convert-vector-to-spirv),imex-convert-gpu-to-spirv,spirv.module(spirv-lower-abi-attrs,spirv-update-vce),func.func(llvm-request-c-wrappers),serialize-spirv,convert-vector-to-scf,convert-gpu-to-gpux,convert-scf-to-cf,convert-cf-to-llvm,convert-vector-to-llvm,convert-index-to-llvm,convert-arith-to-llvm,convert-func-to-llvm,convert-math-to-llvm,convert-gpux-to-llvm,convert-index-to-llvm,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test pass on your machine? For me it fails with the following error:

incorrect lowering for 'linalg.fill'?
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
    %4 = linalg.add ins(%arg2, %2 : tensor<32x4096xf16>, tensor<32x4096xf16>) 
         ^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
    %12 = linalg.add ins(%arg4, %10 : tensor<32x4096xf16>, tensor<32x4096xf16>) 
          ^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()

If I remove all linalg.fill from the test it then fails with another error caused by double deallocations added by insert-gpu-allocs pass. This can be fixed with this patch to IMEX: Menooker/mlir-extensions#3 (have you applied this patch to your IMEX build? If so, we probably should merge it and update IMEX version)

free() problem
0.      Program arguments: /home/jovyan/graph-compiler/build/bin/gc-cpu-runner -e main --entry-point-result=void --shared-libs=/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_runner_utils.so,/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_c_runner_utils.so,/home/jovyan/graph-compiler/build/lib/libGcOpenclRuntime.so
 #0 0x0000562ee351c2a0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b82a0)
 #1 0x0000562ee35193af llvm::sys::RunSignalHandlers() (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b53af)
 #2 0x0000562ee3519505 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f2d1e2716ac (/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so+0x5436ac)
 #4 0x00007f2d544cf520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f2d545323fe __libc_free (/lib/x86_64-linux-gnu/libc.so.6+0xa53fe)
 #6 0x00007f2d54a097aa 
 #7 0x00007f2d54a0a09b 
 #8 0x00007f2d54a0a441 
 #9 0x0000562ee3ad5a0c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#10 0x0000562ee3ad5ead compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#11 0x0000562ee3ad7473 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x973473)
#12 0x0000562ee34546c0 std::vector<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>, std::allocator<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>>>::~vector() /usr/include/c++/11/bits/stl_vector.h:680:15
#13 0x0000562ee34546c0 mlir::DialectRegistry::~DialectRegistry() /home/jovyan/llvm/llvm-gc-master-patches-install/include/mlir/IR/DialectRegistry.h:139:7
#14 0x0000562ee34546c0 main /home/jovyan/graph-compiler/src/gc-cpu-runner/gc-cpu-runner.cpp:46:1
#15 0x00007f2d544b6d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#16 0x00007f2d544b6e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#17 0x0000562ee3505195 _start (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3a1195)

After removing linalg.fill and applying the patch above to IMEX the test passes for me.

@LongshengDu LongshengDu linked an issue Aug 12, 2024 that may be closed by this pull request
%7 = linalg.max ins(%5, %4 : tensor<32x4096xf16>, tensor<32x4096xf16>)
outs(%6 : tensor<32x4096xf16>) -> tensor<32x4096xf16>

%8 = tensor.empty() : tensor<32x4096xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you use it anywhere?

Suggested change
%8 = tensor.empty() : tensor<32x4096xf16>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 44 to 47
%0 = tensor.generate {
^bb0(%i : index, %j : index):
tensor.yield %cst0 : f16
} : tensor<32x4096xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Suggested change
%0 = tensor.generate {
^bb0(%i : index, %j : index):
tensor.yield %cst0 : f16
} : tensor<32x4096xf16>
%0 = arith.constant dense<%cst0> : tensor<32x4096xf16>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use to generate more "random" data, I will see if it is still needed.

@LongshengDu LongshengDu added WIP work in progress and removed ready to review labels Aug 12, 2024
@lmontigny lmontigny linked an issue Aug 20, 2024 that may be closed by this pull request
5 tasks
@LongshengDu
Copy link
Contributor Author

There is a bug inside imex for constant vector store_nd. The code will produce 2D tile in gpu kernel args and reuse the llvm.genx.raw.sends2.noresult. functions that have the same name as 1D tile, thus produce args type mismatch error.
However, the imex fix requires to deal with how the gpu kernel generated(since vector.shape_cast is illegal in convert-xegpu-to-vc pass, and the cast cannot be added here). Thus after added a lot fixes within imex for a long time, I opted for the simpler option to just add vector.shape_cast when lowering linalg.fill.
Note that store_nd for constant is still prone to error inside imex, but we do not need to spend too much time on imex.

@LongshengDu LongshengDu added ready to review and removed WIP work in progress labels Sep 6, 2024
@LongshengDu LongshengDu requested a review from Menooker September 6, 2024 17:40
@@ -32,7 +32,7 @@ jobs:

- uses: actions/checkout@v4
with:
repository: Menooker/mlir-extensions
repository: LongshengDu/mlir-extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need smth like a staging branch in imex since most of the changes should land in the main anyway. cc @Garra1980

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think anyone (with our access rights) can create a branch in the IMEX repo. May we simply create a GC-dev branch in intel/mlir-extensions and that would be it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's exactly what I'm suggesting

Copy link
Contributor

@dchigarev dchigarev Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think anyone (with our access rights) can create a branch in the IMEX repo

Just tried to do so and I think we don't have enough right for this :)
Need some help from @Garra1980

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake/imex.cmake Outdated
gc_fetch_content(imex "${IMEX_HASH}" https://github.com/Menooker/mlir-extensions
gc_fetch_content(imex "${IMEX_HASH}" https://github.com/LongshengDu/mlir-extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we merge the fix to https://github.com/Menooker/mlir-extensions/tree/dev to keep things consistent? We already have some of our patches there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, @Menooker, @LongshengDu, could you please apply the patch?

@LongshengDu LongshengDu merged commit 748b698 into main Sep 11, 2024
8 checks passed
@LongshengDu LongshengDu deleted the longsheng/mlp-xegpu branch September 11, 2024 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests for more complex MLP workload on GPU Make linalg->xegpu->gpu_exe pipeline working
4 participants