Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global math mode for easy use of lower-precision functionality #424

Merged
merged 2 commits into from
Sep 16, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Sep 14, 2020

Fixes #354:

  • 3 possible math modes: pedantic (like a CPU), default (use tensor cores), fast (tensor cores + lower precision calculations). This means the default now changes to using tensor cores.
  • per-task value, 'inherited' when creating new tasks
  • CUDA-level API that configures all submodules:
julia> CUDA.math_mode!(CUDA.FAST_MATH; precision=:Float16)
I! cuBLAS (v11.0) function cublasStatus_t cublasSetMathMode(cublasHandle_t, cublasMath_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xb87b3d0)
i!  mode: type=cublasMath_t; val=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION(17)
i! Time: 2020-09-14T13:54:41 elapsed from start 2.783333 minutes or 167.000000 seconds
i!Process=107610; Thread=139770769617472; GPU=0; Handle=POINTER (IN HEX:0x0xb87b3d0); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_DEFAULT_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)

julia> mul!(CuArray(zeros(Float32,2,2)), CuArray(rand(Float32,2,2)), CuArray(rand(Float32,2,2)))
I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xb87b3d0)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=2
i!  n: type=int; val=2
i!  k: type=int; val=2
i!  alpha: type=void; val=POINTER (IN HEX:0x0x7f1ea78a7370)
i!  A: type=void; val=POINTER (IN HEX:0x0x7f1dc6c00200)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=2
i!  B: type=void; val=POINTER (IN HEX:0x0x7f1dc6c20e00)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=2
i!  beta: type=void; val=POINTER (IN HEX:0x0x7f1ea78a7380)
i!  C: type=void; val=POINTER (IN HEX:0x0x7f1dc6c42400)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=2
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F_FAST_16F(74)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-09-14T13:54:45 elapsed from start 2.850000 minutes or 171.000000 seconds
i!Process=107610; Thread=139770769617472; GPU=0; Handle=POINTER (IN HEX:0x0xb87b3d0); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
2×2 CuArray{Float32,2}:
 0.175258  0.226159
 0.511893  0.331351

Note the CUBLAS_COMPUTE_32F_FAST_16F

TODO: same treatment for CUDNN

@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Sep 14, 2020
@maleadt
Copy link
Member Author

maleadt commented Sep 14, 2020

@denizyuret How is the CUDNN rework progressing? Do you have a PR somewhere? I'm holding off on changing the wrappers to avoid conflicts, but this PR would require some changes there (notably, using the latest v8 descriptor constructors and passing in a math mode).

+function math_type()
+    math_mode = CUDA.math_mode()
+    if math_mode == CUDA.PEDANTIC_MATH
+        CUDNN_DEFAULT_MATH
+    elseif math_mode == CUDA.DEFAULT_MATH
+        CUDNN_TENSOR_OP_MATH
+    elseif math_mode == CUDA.FAST_MATH
+        CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
+    end
+end

That's for implicit use of tensor cores; for explicit use I had the CUBLAS changes in #417, and CUDNN probably needs to be adapted too in order to support explicit (B)Float16 inputs. Are you taking care of those as part of your rework?

@codecov
Copy link

codecov bot commented Sep 14, 2020

Codecov Report

Merging #424 into master will decrease coverage by 0.15%.
The diff coverage is 51.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #424      +/-   ##
==========================================
- Coverage   79.75%   79.59%   -0.16%     
==========================================
  Files         170      170              
  Lines        9051     9088      +37     
==========================================
+ Hits         7219     7234      +15     
- Misses       1832     1854      +22     
Impacted Files Coverage Δ
src/state.jl 85.82% <26.66%> (-7.46%) ⬇️
lib/cublas/wrappers.jl 91.20% <56.25%> (-0.78%) ⬇️
lib/cublas/CUBLAS.jl 79.03% <66.66%> (-3.95%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d52264c...399be99. Read the comment docs.

@denizyuret
Copy link
Contributor

denizyuret commented Sep 15, 2020 via email

@maleadt maleadt merged commit ab19dda into master Sep 16, 2020
@maleadt maleadt deleted the tb/math_mode branch September 16, 2020 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for fast math-like mode
2 participants