-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix and optimize handling of vectorized memory accesses #17767
Conversation
MSHADOW_XINLINE vectorized_storage() {} | ||
MSHADOW_XINLINE ~vectorized_storage() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangential question related to MSHADOW_XINLINE
use in this file: Do we still need inline __attribute__((always_inline))
or is nvcc smart enough to inline? If so, can we stop using the mshadow macros and just use __device__ __host__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSHADOW_XINLINE is useful because it is empty if you are not running with NVCC (and GCC does not understand __host__ __device__
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. MSHADOW_XINLINE
further sets __attribute__((always_inline))
. Do we need that for nvcc
?
With respect to __host__ __device__
, this file is wrapped into MXNET_USE_CUDA && __CUDACC__
, so gcc won't see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, not sure - generally speaking inline
should be enough, but it is only an advice and compiler is allowed to not inline for whatever reason.
@haojin2 I'm hitting again the OoM on Windows build, like you did before. I started looking at the numpy versions of those functions and I see that you have way more templates there (some of which are actually not compiled on Windows) and I started thinking that we should probably just switch to runtime compilation of those kernels - there is just too many variants here. What do you think about this (also @eric-haibin-lin @szha @leezu for comments)? Also - I don't see elementwise ops in the numpy python package, just broadcast ops - this is pretty bad because knowledge that the shapes are the same is pretty important in optimizations - for example the pointwise fusion would not really work for such operators. |
For OoM, could updating the windows toolchain help #17808 ? cc @vexilligera @josephevans |
@mxnet-bot run ci [windows-cpu] |
Jenkins CI successfully triggered : [windows-cpu] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I like to see, in the spirit of test-driven development, is a first commit that includes the test showing the problem fixed, if possible. Since this was not done, I verified that both sub-tests of test_elementwise_ops_on_misaligned_input cause that test to fail on a gpu (i.e. as invoked via test_operator_gpu.py) on a checkout that doesn't include this PR. The other 2 supplied tests do not fail on this prior checkout, as they were designed to validate the broadcast vectorization that will be part of the follow-on PR. All 3 tests pass when run on the CPU, since this is a GPU issue being addressed here, but good to keep the test coverage consistent.
LGTM.
* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
) * Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
) * Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file
* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
* Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file
…apache#18309) * Revert "Fix and optimize handling of vectorized memory accesses (apache#17767)" This reverts commit 5542d03. * add license to reverted file
* Reapplying PR #17767 * Making RTC required * Move cuda utils to src/common/cuda and refactor RTC part * Unary ops via RTC * Support binary_scalar forward Remove elemwise_scatter_op.* Fix BinaryScalar usage in NumPy * Backward of binary scalar * Binary forward * Fix for binary_scalar * Moving all binary forward to RTC Reorganization * Backward of binary ops * Suuport broadcast Add RTC to NumPy ops * RTC for elementwise sum Fixes * RTC for backward usenone of broadcast * RTC for broadcast bwd usein * Remove non-RTC vectorization support * Remove template from ReduceWorkspaceSize * Fixes from rebase * Guarding RTC usage behing MXNET_USE_CUDA * More guards * C++17 for CUDA code * MixedUnaryBackwardInOut as RTC * Removing unused variable * Revert "C++17 for CUDA code" This reverts commit b09090c. * Get rid of CI tests without RTC Get rid of if constexpr as CUDA 10 does not support it * Fix lint * Change a few more elemwise functions Fix for too long value * Fix large tensor build * Another try with DBL_MAX * Fix Windows compilation * Fix the large int test * Add the printing of error code value to CUDA_DRIVER_CALL * Fix * Fix binary scalar * Get more information when cuLaunchKernel fails * Going easy on Windows compiler * Fix lint * Reorganization to split strings due to Windows compilation problems * Fix error with uninitialized value * Fix handling of different types for backward of binary scalar * Decreasing RTC overhead * Fix lint and remove rest of mentions of ENABLE_RTC * Jetson with RTC * Fix the aws s3 command * Debugging Windows failure * More debugging of Windows failure * Debug * Fix the issue on Windows (long -> long long for 8B) * libcuda.so for Jetson * Enable debug information for RTC kernels and cleaning debug ptx dump * Fix lint * Try without linking the stub of libcuda.so to different place in Jetson * Add docstring * Answering review comments * Unifying vectorization * Fix * Fixes for reduce ops * Fix M=1 case * Fixes from rebase Fixes for mixed type gradient functions Set the launch bounds on RTC kernels * Fix * Fix tests * Adding tutorial for RTC * Fixes after merge * Fixes from review * Change env var doc and undo the change to toctree
Description
For operators, which performance is limited by global memory bandwidth, it is important to issue the widest possible loads, as it ensures that the bandwidth is fully utilized.
Currently in MXNet we use vectorized loads and stores only for
half_t
type for a few operators (some elementwise binary operators and elementwise_sum). Unfortunately, the way it was done makes some assumptions about MXNet's NDArrays which do not hold true for all cases.Results in error:
gives:
which is a silent data corruption (the last element of
a
should not have been changed).It was not noticed before because
a + b
for NDArrays launches thebroadcast_add
instead ofelemwise_add
(and is not vectorized), whereas in the symbolic execution slices give new allocations, which do not exhibit those issues.This PR:
half_t
)@eric-haibin-lin @sxjscience @haojin2
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes