Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPU operator #101

Merged
merged 32 commits into from
Jun 8, 2023
Merged

FPU operator #101

merged 32 commits into from
Jun 8, 2023

Conversation

wardvermeulen
Copy link
Collaborator

This PR introduces a new operator FPUOp in an attempt to make GemmKernels.jl function on any CUDA-enabled GPU, i.e. on older GPUs that do not support WMMA.

The FPUOp{M, N, K, DT, CT} operator type has five parameters. (M, N, K) denotes the shape of the operator at the warp level. The base shape of this operator is (4, 8, 1). Every other shape must consist of multiples of these base sizes. For example (8, 8, 1) or (16, 8, 4) are allowed, but not (2, 16, 1). DT is the data type of the C and D matrices of the GEMM, and CT is the compute type of the GEMM, i.e. the data type of matrices A and B.

Evaluation

An operator shape of (8, 8, 1) was used each time for the GemmKernels.jl benchmarks. Other parameters such as block sizes, grid sizes, etc. were tuned manually until a decent result was found. Perhaps faster configurations exist.

GeForce RTX 2080 Ti benchmark

gemm

Tesla V100 benchmark

gemmV100

Limitations

  • Mixed typing is supported and functions correctly. Float16 x Float16 + Float32 is possible, but I have not found a configuration yet where that is faster than Float32 x Float32 + Float32.
  • Float16 x Float16 + Float16 is really imprecise, and fails tests, which is why it is not included in the matmul.jl test file.
  • apply_iterate of Large LocalArray eltypes runs into compiler heuristics  #99 is more prevalent when using FPUOp because the operator size is much smaller than for WMMA. Each LocalArray can hold at max 16 elements. The block shapes should therefore adhere to the following rule:

(blockshape.M / operatorshape.M) * (blockshape.N / operatorshape.N) <= 16

Copy link
Member

@thomasfaingnaert thomasfaingnaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just added some minor comments.

src/operator.jl Outdated Show resolved Hide resolved
src/operator.jl Outdated Show resolved Hide resolved
benchmarks/fpu/benchmark.jl Outdated Show resolved Hide resolved
benchmarks/fpu/plot.jl Outdated Show resolved Hide resolved
@thomasfaingnaert
Copy link
Member

`Float16 x Float16 + Float32` is possible, but I have not found a configuration yet where that is faster than `Float32 x Float32 + Float32`.

That's strange. Do you take into account that if the A and B matrix are Float16, they take up only half the space in global and shared memory, and hence that you can double the tile sizes when comparing Float16 and Float32?

@wardvermeulen
Copy link
Collaborator Author

wardvermeulen commented May 16, 2023

That's strange. Do you take into account that if the A and B matrix are Float16, they take up only half the space in global and shared memory, and hence that you can double the tile sizes when comparing Float16 and Float32?

I should experiment further with this, but should it not it not already be faster without doubling the tile sizes since double the amount of elements can be SIMD loaded?

EDIT: I actually found a tiny improvement in the meantime. A (1024, 1024, 1024) GEMM takes 310us with Float32 and a block shape of (128, 128, 32), and takes 298us with Float16 and a block shape of (128, 128, 64). However, if the block shape stays the same, i.e. at (128, 128, 32), the performance actually becomes worse for Float16 with a time of 321us.

In addition, I added support for other operations than fma, and included tropical mma as an example.
The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).
@wardvermeulen
Copy link
Collaborator Author

wardvermeulen commented Jun 3, 2023

I fixed the 2D indexing of the fragments and changed the accesses to be in column-major order, just to be concise, there is no performance difference. I also added support for other FPU operators with different arithmetic, and included the example for tropical arithmetic as previously discussed in #51.

To fix the failing test, I reduced the tests with large operator sizes. I think Julia 1.6 ran into some apply_iterate issues. The operator shape tests still cover all they need to, though. Not sure whether this is the right approach or not. There is also one WMMA test that fails but did not fail on master, so I am not sure how this can be resolved.

This was referenced Jun 5, 2023
In addition, I added support for other operations than fma, and included tropical mma as an example.
The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).
src/operator.jl Outdated Show resolved Hide resolved
src/operator.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Jun 6, 2023

Better to rebase instead of merging, just do git pull --rebase by default.

@wardvermeulen wardvermeulen mentioned this pull request Jun 7, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants