-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support of half version of isaac blas function #16
Comments
Very good, thanks a lot. I've left a couple of comments to improve this patch, but it looks already pretty good! :) |
So it seems like the half version of value_scalar is never used in practice. Trying to use it would probably lead to some undefined behavior because of all the casts and the lack of an emulation layer for printing and arithmetic operations (the operators just return the arguments). I think it would be better to only add support for half on arrays. It would considerably reduce the size of the pull request and also avoid all this undefined behavior due to the lack of an emulation layer -- that wouldn't be used in Blas anyway. There is also no tests for clBlasHgemm. This header-only implementation of half-precision float could be used http://half.sourceforge.net/. I will work on this over the week-end! PS: I have very little time to maintain ISAAC-1.0 as I'm working on a ISAAC-2.0 internally; sorry for being so picky about the maintainability of the added code. There are already a lot of unused feature in ISAAC (I initially wanted it to have a full expression-tree engine but slowly shifted the focus towards BLAS only) so I'd like to keep pull requests as small as possible. Once all your pull requests get through, I will officially release ISAAC-1.1 ; it's time for a new release :) |
Sorry for late response because of the leave for Chinese new year. I am working on refining the code to remove half version on value_scalar and only half on arrays remained, and will push the new PR soon. We have half-blas-test-suite internally for testing, it also use http://half.sourceforge.net/ for cpu reference. |
I hope you had a great Chinese New Year! I also celebrated with 火锅 and 麻辣香锅, my favorite! Thanks for the changes :) |
Just a small remark: CLBlast does support half-precision already and has half-precision tests, perhaps it can be helpful for you. |
@listenlink What intel iGPUs (if any) support half-precision? On Broadwell the performance of hGEMM and hGEMV is quite bad (15% slower than sGEMM and sGEMV). It is particularly surprising for io-bound GEMV. I don't really see what more we could do on the code-generation size to speed things up; I'll just assume it's HW or compiler limitations (i.e., FP16-32 conversions) until I have more details. @CNugteren That's good to know :) I'll re-tune ISAAC for fp16 on AMD and NVidia hardware to see if the poor performance persists |
@ptillet there may be something strange on the
The second time when have
|
@ptillet From BDW, Intel Gen graphics fully support FP16 and ideally, HFLOPS should be double of FLOPS. The IO bandwidth should remain the same. In practice, we see about 1.5x performance with FP16 data format compare to FP32 format for some computation bound kernel such as convolution kernel. For many GEMM case, we should see similar performance gain. We are working on the optimized FP16 GEMM kernel now, I will discuss more about the FP16 GEMM performance after we finish that work. |
The issue was that the kernel names were colliding with each other in the cache. Since the kernel names were based on the date and time, the kernels were getting compiled so fast that a subsequent kernel would end up with the same name as the previous one and would therefore overwrite it in the cache. It seems to run the same test multiple times but the subsequent runs would end up using the wrong kernel because of the collisions. It is fixed by appending a randomly generated alphanumeric string to keep the kernel names unique.
The issue was that the kernel names were colliding with each other in the cache. Since the kernel names were based on the date and time, the kernels were getting compiled so fast that a subsequent kernel would end up with the same name as the previous one and would therefore overwrite it in the cache. It seems to run the same test multiple times but the subsequent runs would end up using the wrong kernel because of the collisions. It is fixed by appending a randomly generated alphanumeric string to keep the kernel names unique.
When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>
This PR fixes triton-lang#1176 IGC detects the call of `__devicelib_assert_fail` and replace it with a 'safe' implementation. However, the SYCL library contains a 'fallback' implementation of assertion, which does not work in our setup. If we mark the function with `InternalLinkage`, the fallback implementation is inlined and IGC cannot replace it with the safe implementation. By declaring `__devicelib_assert_fail` as an external function in SYCL library, IGC can correctly insert its implementation. The diff between the old and new `libsycl-spir64-unknown-unknown.ll` is as follows: ```diff @@ -5424,149 +5424,7 @@ declare extern_weak dso_local spir_func noundef i32 @_Z18__spirv_AtomicLoadPU3AS declare void @llvm.memcpy.p4.p1.i64(ptr addrspace(4) noalias nocapture writeonly, ptr addrspace(1) noalias nocapture readonly, i64, i1 immarg) triton-lang#16 ; Function Attrs: convergent mustprogress norecurse nounwind -define weak dso_local spir_func void @__devicelib_assert_fail(ptr addrspace(4) noundef %0, ptr addrspace(4) noundef %1, i32 noundef %2, ptr addrspace(4) noundef %3, i64 noundef %4, i64 noundef %5, i64 noundef %6, i64 noundef %7, i64 noundef %8, i64 noundef %9) local_unnamed_addr triton-lang#14 !srcloc !720 { - %11 = tail call spir_func noundef i32 @_Z29__spirv_AtomicCompareExchangePU3AS1iN5__spv5Scope4FlagENS1_19MemorySemanticsMask4FlagES5_ii(ptr addrspace(1) noundef @SPIR_AssertHappenedMem, i32 noundef 1, i32 noundef 16, i32 noundef 16, i32 noundef 1, i32 noundef 0) triton-lang#54 - %12 = icmp eq i32 %11, 0 - br i1 %12, label %13, label %92 - -13: ; preds = %10 - store i32 %2, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 4), align 8, !tbaa !721 - store i64 %4, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 5), align 8, !tbaa !722 - store i64 %5, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 6), align 8, !tbaa !723 - store i64 %6, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 7), align 8, !tbaa !724 - store i64 %7, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 8), align 8, !tbaa !725 - store i64 %8, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 9), align 8, !tbaa !726 - store i64 %9, ptr addrspace(1) getelementptr inbounds (%struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 10), align 8, !tbaa !727 - %14 = icmp eq ptr addrspace(4) %0, null - br i1 %14, label %23, label %15 - -15: ; preds = %20, %13 - %16 = phi i32 [ %22, %20 ], [ 0, %13 ] - %17 = phi ptr addrspace(4) [ %21, %20 ], [ %0, %13 ] - %18 = load i8, ptr addrspace(4) %17, align 1, !tbaa !718 - %19 = icmp eq i8 %18, 0 - br i1 %19, label %23, label %20 - -20: ; preds = %15 - %21 = getelementptr inbounds i8, ptr addrspace(4) %17, i64 1 - %22 = add nuw nsw i32 %16, 1 - br label %15, !llvm.loop !728 - -23: ; preds = %15, %13 - %24 = phi i32 [ 0, %13 ], [ %16, %15 ] - %25 = icmp eq ptr addrspace(4) %1, null - br i1 %25, label %34, label %26 - -26: ; preds = %31, %23 - %27 = phi i32 [ %33, %31 ], [ 0, %23 ] - %28 = phi ptr addrspace(4) [ %32, %31 ], [ %1, %23 ] - %29 = load i8, ptr addrspace(4) %28, align 1, !tbaa !718 - %30 = icmp eq i8 %29, 0 - br i1 %30, label %34, label %31 - -31: ; preds = %26 - %32 = getelementptr inbounds i8, ptr addrspace(4) %28, i64 1 - %33 = add nuw nsw i32 %27, 1 - br label %26, !llvm.loop !729 - -34: ; preds = %26, %23 - %35 = phi i32 [ 0, %23 ], [ %27, %26 ] - %36 = icmp eq ptr addrspace(4) %3, null - br i1 %36, label %37, label %40 - -37: ; preds = %34 - %38 = tail call i32 @llvm.umin.i32(i32 %24, i32 256) - %39 = tail call i32 @llvm.umin.i32(i32 %35, i32 256) - br label %52 - -40: ; preds = %45, %34 - %41 = phi i32 [ %47, %45 ], [ 0, %34 ] - %42 = phi ptr addrspace(4) [ %46, %45 ], [ %3, %34 ] - %43 = load i8, ptr addrspace(4) %42, align 1, !tbaa !718 - %44 = icmp eq i8 %43, 0 - br i1 %44, label %48, label %45 - -45: ; preds = %40 - %46 = getelementptr inbounds i8, ptr addrspace(4) %42, i64 1 - %47 = add i32 %41, 1 - br label %40, !llvm.loop !730 - -48: ; preds = %40 - %49 = tail call i32 @llvm.umin.i32(i32 %24, i32 256) - %50 = tail call i32 @llvm.umin.i32(i32 %35, i32 256) - %51 = tail call i32 @llvm.umin.i32(i32 %41, i32 128) - br label %52 - -52: ; preds = %48, %37 - %53 = phi i32 [ %39, %37 ], [ %50, %48 ] - %54 = phi i32 [ %38, %37 ], [ %49, %48 ] - %55 = phi i32 [ 0, %37 ], [ %51, %48 ] - br label %56 - -56: ; preds = %62, %52 - %57 = phi i32 [ 0, %52 ], [ %67, %62 ] - %58 = icmp ult i32 %57, %54 - br i1 %58, label %62, label %59 - -59: ; preds = %56 - %60 = zext nneg i32 %54 to i64 - %61 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 1, i64 %60 - store i8 0, ptr addrspace(1) %61, align 1, !tbaa !718 - br label %68 - -62: ; preds = %56 - %63 = sext i32 %57 to i64 - %64 = getelementptr inbounds i8, ptr addrspace(4) %0, i64 %63 - %65 = load i8, ptr addrspace(4) %64, align 1, !tbaa !718 - %66 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 1, i64 %63 - store i8 %65, ptr addrspace(1) %66, align 1, !tbaa !718 - %67 = add nuw nsw i32 %57, 1 - br label %56, !llvm.loop !731 - -68: ; preds = %74, %59 - %69 = phi i32 [ 0, %59 ], [ %79, %74 ] - %70 = icmp ult i32 %69, %53 - br i1 %70, label %74, label %71 - -71: ; preds = %68 - %72 = zext nneg i32 %53 to i64 - %73 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 2, i64 %72 - store i8 0, ptr addrspace(1) %73, align 1, !tbaa !718 - br label %80 - -74: ; preds = %68 - %75 = sext i32 %69 to i64 - %76 = getelementptr inbounds i8, ptr addrspace(4) %1, i64 %75 - %77 = load i8, ptr addrspace(4) %76, align 1, !tbaa !718 - %78 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 2, i64 %75 - store i8 %77, ptr addrspace(1) %78, align 1, !tbaa !718 - %79 = add nuw nsw i32 %69, 1 - br label %68, !llvm.loop !732 - -80: ; preds = %86, %71 - %81 = phi i32 [ 0, %71 ], [ %91, %86 ] - %82 = icmp ult i32 %81, %55 - br i1 %82, label %86, label %83 - -83: ; preds = %80 - %84 = sext i32 %55 to i64 - %85 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 3, i64 %84 - store i8 0, ptr addrspace(1) %85, align 1, !tbaa !718 - tail call spir_func void @_Z19__spirv_AtomicStorePU3AS1iN5__spv5Scope4FlagENS1_19MemorySemanticsMask4FlagEi(ptr addrspace(1) noundef @SPIR_AssertHappenedMem, i32 noundef 1, i32 noundef 16, i32 noundef 2) triton-lang#54 - br label %92 - -86: ; preds = %80 - %87 = sext i32 %81 to i64 - %88 = getelementptr inbounds i8, ptr addrspace(4) %3, i64 %87 - %89 = load i8, ptr addrspace(4) %88, align 1, !tbaa !718 - %90 = getelementptr inbounds %struct.AssertHappened, ptr addrspace(1) @SPIR_AssertHappenedMem, i64 0, i32 3, i64 %87 - store i8 %89, ptr addrspace(1) %90, align 1, !tbaa !718 - %91 = add nuw nsw i32 %81, 1 - br label %80, !llvm.loop !733 - -92: ; preds = %83, %10 - ret void -} +declare extern_weak dso_local spir_func void @__devicelib_assert_fail(ptr addrspace(4) noundef %0, ptr addrspace(4) noundef %1, i32 noundef %2, ptr addrspace(4) noundef %3, i64 noundef %4, i64 noundef %5, i64 noundef %6, i64 noundef %7, i64 noundef %8, i64 noundef %9) local_unnamed_addr triton-lang#14 ; Function Attrs: convergent nounwind declare extern_weak dso_local spir_func noundef i32 @_Z29__spirv_AtomicCompareExchangePU3AS1iN5__spv5Scope4FlagENS1_19MemorySemanticsMask4FlagES5_ii(ptr addrspace(1) noundef, i32 noundef, i32 noundef, i32 noundef, i32 noundef, i32 noundef) local_unnamed_addr triton-lang#15 ```
…ng#16) Signed-off-by: Gregory Shimansky <gshimansky@gmail.com>
Hi @ptillet ,
fp16 of isaac blas functionality are added and under review at intel#5
please feel free to come up with your opinion about the patch:) thank you!
The text was updated successfully, but these errors were encountered: