Global math mode for easy use of lower-precision functionality #424

maleadt · 2020-09-14T11:59:37Z

Fixes #354:

3 possible math modes: pedantic (like a CPU), default (use tensor cores), fast (tensor cores + lower precision calculations). This means the default now changes to using tensor cores.
per-task value, 'inherited' when creating new tasks
CUDA-level API that configures all submodules:

julia> CUDA.math_mode!(CUDA.FAST_MATH; precision=:Float16)
I! cuBLAS (v11.0) function cublasStatus_t cublasSetMathMode(cublasHandle_t, cublasMath_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xb87b3d0)
i!  mode: type=cublasMath_t; val=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION(17)
i! Time: 2020-09-14T13:54:41 elapsed from start 2.783333 minutes or 167.000000 seconds
i!Process=107610; Thread=139770769617472; GPU=0; Handle=POINTER (IN HEX:0x0xb87b3d0); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_DEFAULT_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)

julia> mul!(CuArray(zeros(Float32,2,2)), CuArray(rand(Float32,2,2)), CuArray(rand(Float32,2,2)))
I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xb87b3d0)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=2
i!  n: type=int; val=2
i!  k: type=int; val=2
i!  alpha: type=void; val=POINTER (IN HEX:0x0x7f1ea78a7370)
i!  A: type=void; val=POINTER (IN HEX:0x0x7f1dc6c00200)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=2
i!  B: type=void; val=POINTER (IN HEX:0x0x7f1dc6c20e00)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=2
i!  beta: type=void; val=POINTER (IN HEX:0x0x7f1ea78a7380)
i!  C: type=void; val=POINTER (IN HEX:0x0x7f1dc6c42400)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=2
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F_FAST_16F(74)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-09-14T13:54:45 elapsed from start 2.850000 minutes or 171.000000 seconds
i!Process=107610; Thread=139770769617472; GPU=0; Handle=POINTER (IN HEX:0x0xb87b3d0); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
2×2 CuArray{Float32,2}:
 0.175258  0.226159
 0.511893  0.331351

Note the CUBLAS_COMPUTE_32F_FAST_16F

TODO: same treatment for CUDNN

maleadt · 2020-09-14T12:12:31Z

@denizyuret How is the CUDNN rework progressing? Do you have a PR somewhere? I'm holding off on changing the wrappers to avoid conflicts, but this PR would require some changes there (notably, using the latest v8 descriptor constructors and passing in a math mode).

+function math_type()
+    math_mode = CUDA.math_mode()
+    if math_mode == CUDA.PEDANTIC_MATH
+        CUDNN_DEFAULT_MATH
+    elseif math_mode == CUDA.DEFAULT_MATH
+        CUDNN_TENSOR_OP_MATH
+    elseif math_mode == CUDA.FAST_MATH
+        CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
+    end
+end

That's for implicit use of tensor cores; for explicit use I had the CUBLAS changes in #417, and CUDNN probably needs to be adapted too in order to support explicit (B)Float16 inputs. Are you taking care of those as part of your rework?

codecov · 2020-09-14T13:47:41Z

Codecov Report

Merging #424 into master will decrease coverage by 0.15%.
The diff coverage is 51.02%.

@@            Coverage Diff             @@
##           master     #424      +/-   ##
==========================================
- Coverage   79.75%   79.59%   -0.16%     
==========================================
  Files         170      170              
  Lines        9051     9088      +37     
==========================================
+ Hits         7219     7234      +15     
- Misses       1832     1854      +22

Impacted Files	Coverage Δ
src/state.jl	`85.82% <26.66%> (-7.46%)`	⬇️
lib/cublas/wrappers.jl	`91.20% <56.25%> (-0.78%)`	⬇️
lib/cublas/CUBLAS.jl	`79.03% <66.66%> (-3.95%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d52264c...399be99. Read the comment docs.

denizyuret · 2020-09-15T08:44:58Z

Hi Tim, You can see my work so far in a Knet PR: denizyuret/Knet.jl#614 It is about 50% done, I covered the basic operations and ended up spending a lot of time on MultiHeadedAttn (which is important for language models like BERT, GPT etc and which CUDNN started to support natively). Still to go is conv, rnn, batchnorm - but I have already implemented them elsewhere so this should not be too much more work if I can find the time to sit down. Again, this is a Knet PR and I still keep changing the high level interface, so for now (1) only forw functions have high level counterparts, I use back functions as is in the AutoGrad rules, (2) I use AutoGrad to define gradients, (3) I try to define everything array-type agnostic so things work with both KnetArrays and CuArrays. To turn this into a CUDA.jl PR (if/when we want to go that way): 0. I would need to complete all high level methods. 1. I would need to possibly define high level methods for back functions? 2. Either do not define gradients or define them elsewhere? 3. Get rid of KnetArrays and just define them for CuArrays? Any suggestions are welcome.

On Mon, Sep 14, 2020 at 3:12 PM Tim Besard ***@***.***> wrote: @denizyuret <https://github.com/denizyuret> How is the CUDNN rework progressing? Do you have a PR somewhere? I'm holding off on changing the wrappers to avoid conflicts, but this PR would require some changes there (notably, using the latest v8 descriptor constructors and passing in a math mode).

I am good with whatever cudnn version(s) you want to support. For descriptors I found myself repeating a lot of code so wrote a macro: https://github.com/denizyuret/Knet.jl/blob/f133587a35e94b776730ca25464d55d6ef6ce2a3/src/cudnn/common.jl#L61 That way I can define a descriptor in one line: @cudnnDescriptor(Tensor, cudnnSetTensorNdDescriptorEx) The second (optional) argument determines the setter function and the arguments of the default constructor are set to be the arguments to the setter function. Also, all descriptors are memoized which I found out helps with performance.

That's for implicit use of tensor cores; for explicit use I had the CUBLAS changes in #417 <#417>, and CUDNN probably needs to be adapted too in order to support explicit (B)Float16 inputs. Are you taking care of those as part of your rework?

I do want to make using tensor cores default/easy. So for example here I define the math type to be the fastest allowed by cudnn v8 doc: https://github.com/denizyuret/Knet.jl/blob/f133587a35e94b776730ca25464d55d6ef6ce2a3/src/cudnn/multiheadattn.jl#L256 I still haven't tested this approach on other cudnn versions and nvidia chips, so still some work to see how well it will generalize.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#424 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN43JQ4XECOVYM56U5C3XDSFYCD5ANCNFSM4RLQJX7Q> .

Introduce a global math mode.

e4ad960

maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Sep 14, 2020

Make CUBLAS respect the global math mode.

399be99

maleadt force-pushed the tb/math_mode branch from dd1fb3e to 399be99 Compare September 14, 2020 12:25

maleadt merged commit ab19dda into master Sep 16, 2020

maleadt deleted the tb/math_mode branch September 16, 2020 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global math mode for easy use of lower-precision functionality #424

Global math mode for easy use of lower-precision functionality #424

maleadt commented Sep 14, 2020

maleadt commented Sep 14, 2020 •

edited

Loading

codecov bot commented Sep 14, 2020 •

edited

Loading

denizyuret commented Sep 15, 2020 via email

Global math mode for easy use of lower-precision functionality #424

Global math mode for easy use of lower-precision functionality #424

Conversation

maleadt commented Sep 14, 2020

maleadt commented Sep 14, 2020 • edited Loading

codecov bot commented Sep 14, 2020 • edited Loading

Codecov Report

denizyuret commented Sep 15, 2020 via email

maleadt commented Sep 14, 2020 •

edited

Loading

codecov bot commented Sep 14, 2020 •

edited

Loading