Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix and optimize handling of vectorized memory accesses #17767

Merged
merged 41 commits into from
Apr 17, 2020

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 5, 2020

Description

For operators, which performance is limited by global memory bandwidth, it is important to issue the widest possible loads, as it ensures that the bandwidth is fully utilized.

Currently in MXNet we use vectorized loads and stores only for half_t type for a few operators (some elementwise binary operators and elementwise_sum). Unfortunately, the way it was done makes some assumptions about MXNet's NDArrays which do not hold true for all cases.

  • Failure 1:
import mxnet as mx
ctx=mx.gpu()
a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)

c = a[1:3]
d = b[1:3]
mx.nd.elemwise_add(c, d, out=c)

Results in error:

Check failed: e == cudaSuccess: CUDA: misaligned address
  • Failure 2:
import mxnet as mx
ctx=mx.gpu()
a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)

print(a)
c = a[0:3]
d = b[0:3]
mx.nd.elemwise_add(c, d, out=c)
mx.nd.waitall()
print(c)
print(a)

gives:

[1. 2. 3. 4.]
<NDArray 4 @gpu(0)>

[2. 4. 6.]
<NDArray 3 @gpu(0)>

[2. 4. 6. 8.]
<NDArray 4 @gpu(0)>

which is a silent data corruption (the last element of a should not have been changed).

It was not noticed before because a + b for NDArrays launches the broadcast_add instead of elemwise_add (and is not vectorized), whereas in the symbolic execution slices give new allocations, which do not exhibit those issues.

This PR:

  • fixes those issues
  • introduces helpers for handling vectorization (for all types, not only half_t)
  • increases performance of vectorized kernels
  • introduces vectorization for all binary/unary/binary with scalar ops
  • (WIP) introduces vectorization for broadcast ops

@eric-haibin-lin @sxjscience @haojin2

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Code is well-documented:
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Properly handle vectorized loads and stores in elementwise kernels
  • Handle vectorized loads and stores in elementwise broadcast kernels

Comment on lines +46 to +47
MSHADOW_XINLINE vectorized_storage() {}
MSHADOW_XINLINE ~vectorized_storage() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential question related to MSHADOW_XINLINE use in this file: Do we still need inline __attribute__((always_inline)) or is nvcc smart enough to inline? If so, can we stop using the mshadow macros and just use __device__ __host__?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSHADOW_XINLINE is useful because it is empty if you are not running with NVCC (and GCC does not understand __host__ __device__).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. MSHADOW_XINLINE further sets __attribute__((always_inline)). Do we need that for nvcc?

With respect to __host__ __device__, this file is wrapped into MXNET_USE_CUDA && __CUDACC__, so gcc won't see it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, not sure - generally speaking inline should be enough, but it is only an advice and compiler is allowed to not inline for whatever reason.

@ptrendx
Copy link
Member Author

ptrendx commented Mar 21, 2020

@haojin2 I'm hitting again the OoM on Windows build, like you did before. I started looking at the numpy versions of those functions and I see that you have way more templates there (some of which are actually not compiled on Windows) and I started thinking that we should probably just switch to runtime compilation of those kernels - there is just too many variants here. What do you think about this (also @eric-haibin-lin @szha @leezu for comments)?

Also - I don't see elementwise ops in the numpy python package, just broadcast ops - this is pretty bad because knowledge that the shapes are the same is pretty important in optimizations - for example the pointwise fusion would not really work for such operators.

@leezu
Copy link
Contributor

leezu commented Mar 21, 2020

For OoM, could updating the windows toolchain help #17808 ? cc @vexilligera @josephevans

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

@mxnet-bot run ci [windows-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu]

Copy link
Contributor

@DickJC123 DickJC123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I like to see, in the spirit of test-driven development, is a first commit that includes the test showing the problem fixed, if possible. Since this was not done, I verified that both sub-tests of test_elementwise_ops_on_misaligned_input cause that test to fail on a gpu (i.e. as invoked via test_operator_gpu.py) on a checkout that doesn't include this PR. The other 2 supplied tests do not fail on this prior checkout, as they were designed to validate the broadcast vectorization that will be part of the follow-on PR. All 3 tests pass when run on the CPU, since this is a GPU issue being addressed here, but good to keep the test coverage consistent.

LGTM.

@DickJC123 DickJC123 merged commit 5542d03 into apache:master Apr 17, 2020
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Apr 18, 2020
* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Apr 18, 2020
* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
ptrendx added a commit that referenced this pull request Apr 18, 2020
)

* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Apr 20, 2020
* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
ptrendx added a commit that referenced this pull request Apr 21, 2020
)

* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request May 12, 2020
leezu pushed a commit that referenced this pull request May 13, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (#17767)"

This reverts commit 5542d03.

* add license to reverted file
rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request May 14, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)"

This reverts commit 5542d03.

* add license to reverted file
rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request May 14, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)"

This reverts commit 5542d03.

* add license to reverted file
ptrendx pushed a commit that referenced this pull request May 27, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (#17767)"

This reverts commit 5542d03.

* add license to reverted file
eric-haibin-lin pushed a commit that referenced this pull request May 29, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (#17767)"

This reverts commit 5542d03.

* add license to reverted file
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Jun 25, 2020
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Jun 25, 2020
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* Vectorized loads for binary elemwise kernel

* More generalization

* Add backwardusenone

* Remove the unused _backward_add op

* Add vectorized backwardusein

* Extending vectorization to more binary ops, binary ops with scalar and
unary ops

* Handling ElementwiseSum

* Get rid of half2 in mshadow

* Remove backward_elemwiseaddex

* Revert "Remove the unused _backward_add op"

This reverts commit f86da86.

* Revert "Remove backward_elemwiseaddex"

This reverts commit 7729114.

* Add back the backward_add since C++ test relies on it

* Test bcast implementations

* First version of vecotrized bcast

* Adding single side vectorized bcast kernel

* Removing debug prints

* Actually run the single side kernel

* Move the default implementation of bcast to the vectorized one

* Limit the new implementation to GPU only

* Enabling vectorization when broadcast does not actually do broadcast

* Cleaning

* Cleaning part 2

* Fix for numpy ops using stuff from broadcast

* Fix

* Fix lint

* Try to debug pinv numpy test

* Fix

* Fix the vectorized broadcast implementation for misaligned input
pointers

* Added tests

* Added docs to cuda_vectorization.cuh

* Another fix for broadcast and fix INT64 compilation

* Optimize for aligned=true

* 1 more addition to test

* Reverting the change to Numpy op test

* Trying mcmodel=medium to fix the failure in CMake static build

* Revert "Trying mcmodel=medium to fix the failure in CMake static build"

This reverts commit 1af684c.

* Limiting the PR to just elementwise ops
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)"

This reverts commit 5542d03.

* add license to reverted file
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
…apache#18309)

* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)"

This reverts commit 5542d03.

* add license to reverted file
szha pushed a commit that referenced this pull request Aug 20, 2020
* Reapplying PR #17767

* Making RTC required

* Move cuda utils to src/common/cuda and refactor RTC part

* Unary ops via RTC

* Support binary_scalar forward

Remove elemwise_scatter_op.*

Fix BinaryScalar usage in NumPy

* Backward of binary scalar

* Binary forward

* Fix for binary_scalar

* Moving all binary forward to RTC

Reorganization

* Backward of binary ops

* Suuport broadcast

Add RTC to NumPy ops

* RTC for elementwise sum

Fixes

* RTC for backward usenone of broadcast

* RTC for broadcast bwd usein

* Remove non-RTC vectorization support

* Remove template from ReduceWorkspaceSize

* Fixes from rebase

* Guarding RTC usage behing MXNET_USE_CUDA

* More guards

* C++17 for CUDA code

* MixedUnaryBackwardInOut as RTC

* Removing unused variable

* Revert "C++17 for CUDA code"

This reverts commit b09090c.

* Get rid of CI tests without RTC
Get rid of if constexpr as CUDA 10 does not support it

* Fix lint

* Change a few more elemwise functions
Fix for too long value

* Fix large tensor build

* Another try with DBL_MAX

* Fix Windows compilation

* Fix the large int test

* Add the printing of error code value to CUDA_DRIVER_CALL

* Fix

* Fix binary scalar

* Get more information when cuLaunchKernel fails

* Going easy on Windows compiler

* Fix lint

* Reorganization to split strings due to Windows compilation problems

* Fix error with uninitialized value

* Fix handling of different types for backward of binary scalar

* Decreasing RTC overhead

* Fix lint and remove rest of mentions of ENABLE_RTC

* Jetson with RTC

* Fix the aws s3 command

* Debugging Windows failure

* More debugging of Windows failure

* Debug

* Fix the issue on Windows (long -> long long for 8B)

* libcuda.so for Jetson

* Enable debug information for RTC kernels and cleaning debug ptx dump

* Fix lint

* Try without linking the stub of libcuda.so to different place in Jetson

* Add docstring

* Answering review comments

* Unifying vectorization

* Fix

* Fixes for reduce ops

* Fix M=1 case

* Fixes from rebase
Fixes for mixed type gradient functions
Set the launch bounds on RTC kernels

* Fix

* Fix tests

* Adding tutorial for RTC

* Fixes after merge

* Fixes from review

* Change env var doc and undo the change to toctree
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants