-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPU operator #101
FPU operator #101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just added some minor comments.
That's strange. Do you take into account that if the A and B matrix are Float16, they take up only half the space in global and shared memory, and hence that you can double the tile sizes when comparing Float16 and Float32? |
I should experiment further with this, but should it not it not already be faster without doubling the tile sizes since double the amount of elements can be SIMD loaded? EDIT: I actually found a tiny improvement in the meantime. A (1024, 1024, 1024) GEMM takes 310us with |
In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).
I fixed the 2D indexing of the fragments and changed the accesses to be in column-major order, just to be concise, there is no performance difference. I also added support for other FPU operators with different arithmetic, and included the example for tropical arithmetic as previously discussed in #51. To fix the failing test, I reduced the tests with large operator sizes. I think Julia 1.6 ran into some |
In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).
Better to rebase instead of merging, just do |
This PR introduces a new operator
FPUOp
in an attempt to make GemmKernels.jl function on any CUDA-enabled GPU, i.e. on older GPUs that do not support WMMA.The
FPUOp{M, N, K, DT, CT}
operator type has five parameters. (M, N, K) denotes the shape of the operator at the warp level. The base shape of this operator is (4, 8, 1). Every other shape must consist of multiples of these base sizes. For example (8, 8, 1) or (16, 8, 4) are allowed, but not (2, 16, 1). DT is the data type of the C and D matrices of the GEMM, and CT is the compute type of the GEMM, i.e. the data type of matrices A and B.Evaluation
An operator shape of (8, 8, 1) was used each time for the GemmKernels.jl benchmarks. Other parameters such as block sizes, grid sizes, etc. were tuned manually until a decent result was found. Perhaps faster configurations exist.
GeForce RTX 2080 Ti benchmark
Tesla V100 benchmark
Limitations
Float16 x Float16 + Float32
is possible, but I have not found a configuration yet where that is faster thanFloat32 x Float32 + Float32
.Float16 x Float16 + Float16
is really imprecise, and fails tests, which is why it is not included in thematmul.jl
test file.FPUOp
because the operator size is much smaller than for WMMA. Each LocalArray can hold at max 16 elements. The block shapes should therefore adhere to the following rule:(blockshape.M / operatorshape.M) * (blockshape.N / operatorshape.N) <= 16