FPU operator #101

wardvermeulen · 2023-05-14T14:05:52Z

This PR introduces a new operator FPUOp in an attempt to make GemmKernels.jl function on any CUDA-enabled GPU, i.e. on older GPUs that do not support WMMA.

The FPUOp{M, N, K, DT, CT} operator type has five parameters. (M, N, K) denotes the shape of the operator at the warp level. The base shape of this operator is (4, 8, 1). Every other shape must consist of multiples of these base sizes. For example (8, 8, 1) or (16, 8, 4) are allowed, but not (2, 16, 1). DT is the data type of the C and D matrices of the GEMM, and CT is the compute type of the GEMM, i.e. the data type of matrices A and B.

Evaluation

An operator shape of (8, 8, 1) was used each time for the GemmKernels.jl benchmarks. Other parameters such as block sizes, grid sizes, etc. were tuned manually until a decent result was found. Perhaps faster configurations exist.

GeForce RTX 2080 Ti benchmark

Tesla V100 benchmark

Limitations

Mixed typing is supported and functions correctly. Float16 x Float16 + Float32 is possible, but I have not found a configuration yet where that is faster than Float32 x Float32 + Float32.
Float16 x Float16 + Float16 is really imprecise, and fails tests, which is why it is not included in the matmul.jl test file.
apply_iterate of Large LocalArray eltypes runs into compiler heuristics #99 is more prevalent when using FPUOp because the operator size is much smaller than for WMMA. Each LocalArray can hold at max 16 elements. The block shapes should therefore adhere to the following rule:

(blockshape.M / operatorshape.M) * (blockshape.N / operatorshape.N) <= 16

thomasfaingnaert

Looks good! Just added some minor comments.

src/operator.jl

benchmarks/fpu/benchmark.jl

benchmarks/fpu/plot.jl

thomasfaingnaert · 2023-05-16T13:29:03Z

`Float16 x Float16 + Float32` is possible, but I have not found a configuration yet where that is faster than `Float32 x Float32 + Float32`.

That's strange. Do you take into account that if the A and B matrix are Float16, they take up only half the space in global and shared memory, and hence that you can double the tile sizes when comparing Float16 and Float32?

wardvermeulen · 2023-05-16T15:43:26Z

That's strange. Do you take into account that if the A and B matrix are Float16, they take up only half the space in global and shared memory, and hence that you can double the tile sizes when comparing Float16 and Float32?

I should experiment further with this, but should it not it not already be faster without doubling the tile sizes since double the amount of elements can be SIMD loaded?

EDIT: I actually found a tiny improvement in the meantime. A (1024, 1024, 1024) GEMM takes 310us with Float32 and a block shape of (128, 128, 32), and takes 298us with Float16 and a block shape of (128, 128, 64). However, if the block shape stays the same, i.e. at (128, 128, 32), the performance actually becomes worse for Float16 with a time of 321us.

In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).

wardvermeulen · 2023-06-03T07:40:47Z

I fixed the 2D indexing of the fragments and changed the accesses to be in column-major order, just to be concise, there is no performance difference. I also added support for other FPU operators with different arithmetic, and included the example for tropical arithmetic as previously discussed in #51.

To fix the failing test, I reduced the tests with large operator sizes. I think Julia 1.6 ran into some apply_iterate issues. The operator shape tests still cover all they need to, though. Not sure whether this is the right approach or not. There is also one WMMA test that fails but did not fail on master, so I am not sure how this can be resolved.

In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).

src/operator.jl

… into FPUOp

maleadt · 2023-06-06T13:22:11Z

Better to rebase instead of merging, just do git pull --rebase by default.

wardvermeulen added 11 commits May 10, 2023 12:46

FPUOp relevant changes from dev

79651bf

attempt at letting RowMajor work with FPUOp

4b70b54

operater shape + mixed types

f77d9c2

add compute and data type test

d27d7a3

more performant because less registers needed

352ba15

finished benchmarking, still work on cutlass

49f74a9

add plot

f0412db

mixed precision fix

0c98756

add config file

c56e8a7

preparation for pr

117adde

small fix to benchmarking

cc918d0

This was linked to issues May 15, 2023

Feature request: support for matmul with integer matrices #64

Closed

Feature request: support Matrix{Float32} = Matrix{Float32} × Matrix{Float32} #75

Closed

Merge branch 'master' into FPUOp

18e7fe5

thomasfaingnaert reviewed May 16, 2023

View reviewed changes

src/operator.jl Outdated Show resolved Hide resolved

src/operator.jl Outdated Show resolved Hide resolved

benchmarks/fpu/benchmark.jl Outdated Show resolved Hide resolved

benchmarks/fpu/plot.jl Outdated Show resolved Hide resolved

wardvermeulen added 2 commits May 16, 2023 22:56

resolve some issues

ae2176b

fix 2D indexing

a9dc939

In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).

This was referenced Jun 5, 2023

FPU operator #81

Closed

Bump compat bounds to use newer CUDA.jl #103

Merged

wardvermeulen added 8 commits June 5, 2023 12:30

FPUOp relevant changes from dev

ff43c0a

attempt at letting RowMajor work with FPUOp

862369e

operater shape + mixed types

07fe75e

add compute and data type test

c35dcf5

more performant because less registers needed

d851dab

finished benchmarking, still work on cutlass

1f49263

add plot

5efa8eb

mixed precision fix

af20919

wardvermeulen added 5 commits June 5, 2023 12:30

add config file

c3618a4

preparation for pr

20bfa9e

small fix to benchmarking

6bf9e8b

resolve some issues

3bb2507

fix 2D indexing

0d6da7c

In addition, I added support for other operations than fma, and included tropical mma as an example. The elements in shared memory are now accessed in a column-major order, for consistency purposes (there was no difference in performance).

maleadt force-pushed the FPUOp branch from a9dc939 to 0d6da7c Compare June 5, 2023 10:30

Bump CI timeout

3746767

thomasfaingnaert reviewed Jun 6, 2023

View reviewed changes

src/operator.jl Outdated Show resolved Hide resolved

src/operator.jl Outdated Show resolved Hide resolved

Merge branch 'FPUOp' of https://github.com/wardvermeulen/GemmKernels.jl…

ed9e0b3

… into FPUOp

wardvermeulen added 2 commits June 6, 2023 17:36

clean up mma, add integer and tropical tests

6fa7753

add cutlass kernels closest to fpu configuration

4085d20

wardvermeulen mentioned this pull request Jun 7, 2023

Tensor contractions #105

Draft

5 tasks

add information on benchmarking with nvprof

148c57a

thomasfaingnaert approved these changes Jun 8, 2023

View reviewed changes

thomasfaingnaert merged commit 9f8cc9e into JuliaGPU:master Jun 8, 2023

maleadt mentioned this pull request Jun 29, 2023

Tropical mma (WIP) #51

Closed

GiggleLiu mentioned this pull request Jul 5, 2023

Added wrapped C cuda code and runable examples TensorBFS/CuTropicalGEMM.jl#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FPU operator #101

FPU operator #101

wardvermeulen commented May 14, 2023

thomasfaingnaert left a comment

thomasfaingnaert commented May 16, 2023

wardvermeulen commented May 16, 2023 •

edited

Loading

wardvermeulen commented Jun 3, 2023 •

edited

Loading

maleadt commented Jun 6, 2023

FPU operator #101

FPU operator #101

Conversation

wardvermeulen commented May 14, 2023

Evaluation

GeForce RTX 2080 Ti benchmark

Tesla V100 benchmark

Limitations

thomasfaingnaert left a comment

Choose a reason for hiding this comment

thomasfaingnaert commented May 16, 2023

wardvermeulen commented May 16, 2023 • edited Loading

wardvermeulen commented Jun 3, 2023 • edited Loading

maleadt commented Jun 6, 2023

wardvermeulen commented May 16, 2023 •

edited

Loading

wardvermeulen commented Jun 3, 2023 •

edited

Loading