Fix and optimize handling of vectorized memory accesses #17767

ptrendx · 2020-03-05T00:45:02Z

Description

For operators, which performance is limited by global memory bandwidth, it is important to issue the widest possible loads, as it ensures that the bandwidth is fully utilized.

Currently in MXNet we use vectorized loads and stores only for half_t type for a few operators (some elementwise binary operators and elementwise_sum). Unfortunately, the way it was done makes some assumptions about MXNet's NDArrays which do not hold true for all cases.

Failure 1:

import mxnet as mx
ctx=mx.gpu()
a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)

c = a[1:3]
d = b[1:3]
mx.nd.elemwise_add(c, d, out=c)

Results in error:

Check failed: e == cudaSuccess: CUDA: misaligned address

Failure 2:

import mxnet as mx
ctx=mx.gpu()
a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)

print(a)
c = a[0:3]
d = b[0:3]
mx.nd.elemwise_add(c, d, out=c)
mx.nd.waitall()
print(c)
print(a)

gives:

[1. 2. 3. 4.]
<NDArray 4 @gpu(0)>

[2. 4. 6.]
<NDArray 3 @gpu(0)>

[2. 4. 6. 8.]
<NDArray 4 @gpu(0)>

which is a silent data corruption (the last element of a should not have been changed).

It was not noticed before because a + b for NDArrays launches the broadcast_add instead of elemwise_add (and is not vectorized), whereas in the symbolic execution slices give new allocations, which do not exhibit those issues.

This PR:

fixes those issues
introduces helpers for handling vectorization (for all types, not only half_t)
increases performance of vectorized kernels
introduces vectorization for all binary/unary/binary with scalar ops
(WIP) introduces vectorization for broadcast ops

@eric-haibin-lin @sxjscience @haojin2

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
For new C++ functions in header files, their functionalities and arguments are documented.
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Properly handle vectorized loads and stores in elementwise kernels
Handle vectorized loads and stores in elementwise broadcast kernels

unary ops

leezu · 2020-03-05T18:12:07Z

src/common/cuda_vectorization.cuh

+    MSHADOW_XINLINE vectorized_storage() {}
+    MSHADOW_XINLINE ~vectorized_storage() {}


Tangential question related to MSHADOW_XINLINE use in this file: Do we still need inline __attribute__((always_inline)) or is nvcc smart enough to inline? If so, can we stop using the mshadow macros and just use __device__ __host__?

MSHADOW_XINLINE is useful because it is empty if you are not running with NVCC (and GCC does not understand __host__ __device__).

True. MSHADOW_XINLINE further sets __attribute__((always_inline)). Do we need that for nvcc?

With respect to __host__ __device__, this file is wrapped into MXNET_USE_CUDA && __CUDACC__, so gcc won't see it.

Frankly, not sure - generally speaking inline should be enough, but it is only an advice and compiler is allowed to not inline for whatever reason.

This reverts commit f86da86.

This reverts commit 7729114.

ptrendx · 2020-03-21T00:01:22Z

@haojin2 I'm hitting again the OoM on Windows build, like you did before. I started looking at the numpy versions of those functions and I see that you have way more templates there (some of which are actually not compiled on Windows) and I started thinking that we should probably just switch to runtime compilation of those kernels - there is just too many variants here. What do you think about this (also @eric-haibin-lin @szha @leezu for comments)?

Also - I don't see elementwise ops in the numpy python package, just broadcast ops - this is pretty bad because knowledge that the shapes are the same is pretty important in optimizations - for example the pointwise fusion would not really work for such operators.

leezu · 2020-03-21T00:47:29Z

For OoM, could updating the windows toolchain help #17808 ? cc @vexilligera @josephevans

leezu · 2020-04-17T22:12:54Z

@mxnet-bot run ci [windows-cpu]

mxnet-bot · 2020-04-17T22:12:59Z

Jenkins CI successfully triggered : [windows-cpu]

DickJC123

One thing I like to see, in the spirit of test-driven development, is a first commit that includes the test showing the problem fixed, if possible. Since this was not done, I verified that both sub-tests of test_elementwise_ops_on_misaligned_input cause that test to fail on a gpu (i.e. as invoked via test_operator_gpu.py) on a checkout that doesn't include this PR. The other 2 supplied tests do not fail on this prior checkout, as they were designed to validate the broadcast vectorization that will be part of the follow-on PR. All 3 tests pass when run on the CPU, since this is a GPU issue being addressed here, but good to keep the test coverage consistent.

LGTM.

* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops

) * Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops

* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops

) * Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops

…he#17767)" This reverts commit 5542d03.

* Revert "Fix and optimize handling of vectorized memory accesses (#17767)" This reverts commit 5542d03. * add license to reverted file

* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file

* Revert "Fix and optimize handling of vectorized memory accesses (#17767)" This reverts commit 5542d03. * add license to reverted file

* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops

* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file

…apache#18309) * Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file

* Reapplying PR #17767 * Making RTC required * Move cuda utils to src/common/cuda and refactor RTC part * Unary ops via RTC * Support binary_scalar forward Remove elemwise_scatter_op.* Fix BinaryScalar usage in NumPy * Backward of binary scalar * Binary forward * Fix for binary_scalar * Moving all binary forward to RTC Reorganization * Backward of binary ops * Suuport broadcast Add RTC to NumPy ops * RTC for elementwise sum Fixes * RTC for backward usenone of broadcast * RTC for broadcast bwd usein * Remove non-RTC vectorization support * Remove template from ReduceWorkspaceSize * Fixes from rebase * Guarding RTC usage behing MXNET_USE_CUDA * More guards * C++17 for CUDA code * MixedUnaryBackwardInOut as RTC * Removing unused variable * Revert "C++17 for CUDA code" This reverts commit b09090c. * Get rid of CI tests without RTC Get rid of if constexpr as CUDA 10 does not support it * Fix lint * Change a few more elemwise functions Fix for too long value * Fix large tensor build * Another try with DBL_MAX * Fix Windows compilation * Fix the large int test * Add the printing of error code value to CUDA_DRIVER_CALL * Fix * Fix binary scalar * Get more information when cuLaunchKernel fails * Going easy on Windows compiler * Fix lint * Reorganization to split strings due to Windows compilation problems * Fix error with uninitialized value * Fix handling of different types for backward of binary scalar * Decreasing RTC overhead * Fix lint and remove rest of mentions of ENABLE_RTC * Jetson with RTC * Fix the aws s3 command * Debugging Windows failure * More debugging of Windows failure * Debug * Fix the issue on Windows (long -> long long for 8B) * libcuda.so for Jetson * Enable debug information for RTC kernels and cleaning debug ptx dump * Fix lint * Try without linking the stub of libcuda.so to different place in Jetson * Add docstring * Answering review comments * Unifying vectorization * Fix * Fixes for reduce ops * Fix M=1 case * Fixes from rebase Fixes for mixed type gradient functions Set the launch bounds on RTC kernels * Fix * Fix tests * Adding tutorial for RTC * Fixes after merge * Fixes from review * Change env var doc and undo the change to toctree

ptrendx added 8 commits March 4, 2020 16:06

Vectorized loads for binary elemwise kernel

2fe0eaf

More generalization

6b89506

Add backwardusenone

37d81c8

Remove the unused _backward_add op

f86da86

Add vectorized backwardusein

ea56552

Extending vectorization to more binary ops, binary ops with scalar and

ec08749

unary ops

Handling ElementwiseSum

28e5877

Get rid of half2 in mshadow

541aebb

ptrendx requested review from DickJC123 and sxjscience March 5, 2020 00:45

Remove backward_elemwiseaddex

7729114

leezu reviewed Mar 5, 2020

View reviewed changes

ptrendx added 16 commits March 5, 2020 11:19

Revert "Remove the unused _backward_add op"

8455c0d

This reverts commit f86da86.

Revert "Remove backward_elemwiseaddex"

402bb59

This reverts commit 7729114.

Add back the backward_add since C++ test relies on it

716aa1a

Test bcast implementations

948cea1

First version of vecotrized bcast

f326f7e

Adding single side vectorized bcast kernel

85f6070

Removing debug prints

ed8d745

Actually run the single side kernel

3d84675

Move the default implementation of bcast to the vectorized one

3227476

Limit the new implementation to GPU only

2017f75

Enabling vectorization when broadcast does not actually do broadcast

320e91a

Cleaning

4decacd

Cleaning part 2

a16cec0

Fix for numpy ops using stuff from broadcast

ff2243d

Fix

ecbdc6d

Fix lint

2592e53

DickJC123 approved these changes Apr 17, 2020

View reviewed changes

DickJC123 merged commit 5542d03 into apache:master Apr 17, 2020

ptrendx mentioned this pull request Apr 18, 2020

Fix and optimize handling of vectorized memory accesses (#17767) #18095

Merged

ciyongch mentioned this pull request Apr 21, 2020

[v1.7.x] cherry pick #17741 to v1.7.x #18113

Merged

chinakook mentioned this pull request Apr 26, 2020

/usr/bin/ld fail when build #18170

Closed

karan6181 mentioned this pull request May 11, 2020

[Performance Regression] GPU memory increase for training and inference models #18280

Open

rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request May 12, 2020

Revert "Fix and optimize handling of vectorized memory accesses (apac…

dc8eb60

…he#17767)" This reverts commit 5542d03.

rondogency mentioned this pull request May 12, 2020

Revert PR 17767 for fixing GPU memory usage regression #18283

Merged

leezu pushed a commit that referenced this pull request May 13, 2020

Revert PR 17767 for fixing GPU memory usage regression (#18283)

47a38d1

* Revert "Fix and optimize handling of vectorized memory accesses (#17767)" This reverts commit 5542d03. * add license to reverted file

rondogency mentioned this pull request May 14, 2020

[1.x] Revert PR 17767 for fixing GPU memory usage regression (#18283) #18309

Merged

rondogency mentioned this pull request May 14, 2020

Revert PR 17767 [1.7 branch] (#18283) #18311

Merged

ptrendx pushed a commit that referenced this pull request May 27, 2020

Revert PR 17767 for fixing GPU memory usage regression (#18283) (#18311)

4033a85

* Revert "Fix and optimize handling of vectorized memory accesses (#17767)" This reverts commit 5542d03. * add license to reverted file

ptrendx added a commit to ptrendx/mxnet that referenced this pull request Jun 25, 2020

Reapplying PR apache#17767

56e0bea

ptrendx added a commit to ptrendx/mxnet that referenced this pull request Jun 25, 2020

Reapplying PR apache#17767

98aceba

szhengac mentioned this pull request Sep 15, 2020

CUDA: Check failed: e == cudaSuccess: misaligned address with 3-layer BERT pretraining #19155

Open

ptrendx mentioned this pull request Apr 22, 2021

[PERF] Moving GPU softmax to RTC and optimizations #19905

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and optimize handling of vectorized memory accesses #17767

Fix and optimize handling of vectorized memory accesses #17767

ptrendx commented Mar 5, 2020 •

edited

Loading

leezu Mar 5, 2020

ptrendx Mar 5, 2020

leezu Mar 5, 2020

ptrendx Mar 5, 2020

ptrendx commented Mar 21, 2020

leezu commented Mar 21, 2020 •

edited

Loading

leezu commented Apr 17, 2020

mxnet-bot commented Apr 17, 2020

DickJC123 left a comment

		MSHADOW_XINLINE vectorized_storage() {}
		MSHADOW_XINLINE ~vectorized_storage() {}

Fix and optimize handling of vectorized memory accesses #17767

Fix and optimize handling of vectorized memory accesses #17767

Conversation

ptrendx commented Mar 5, 2020 • edited Loading

Description

Checklist

Essentials

Changes

leezu Mar 5, 2020

Choose a reason for hiding this comment

ptrendx Mar 5, 2020

Choose a reason for hiding this comment

leezu Mar 5, 2020

Choose a reason for hiding this comment

ptrendx Mar 5, 2020

Choose a reason for hiding this comment

ptrendx commented Mar 21, 2020

leezu commented Mar 21, 2020 • edited Loading

leezu commented Apr 17, 2020

mxnet-bot commented Apr 17, 2020

DickJC123 left a comment

Choose a reason for hiding this comment

ptrendx commented Mar 5, 2020 •

edited

Loading

leezu commented Mar 21, 2020 •

edited

Loading