-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
@mxnet-label-bot add [Operator, pr-awaiting-review] |
src/operator/nn/softmax-inl.h
Outdated
const int softmax_threads_per_block = 512; | ||
|
||
template <typename OP, typename T> | ||
__device__ inline T warp_reduce(T value, OP redfun) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a generic function that can be used elsewhere. Is there a better place to put this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will look into putting it in some better place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like common cuda utilities are put here https://github.com/apache/incubator-mxnet/blob/master/src/common/cuda_utils.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, there are other common functions that spread in other files, e.g., the wrapper of the warp-level primitives in https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/layer_norm.cu#L32-L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/operator/nn/softmax-inl.h
Outdated
// the division by zero warning generated for such invalid cases. | ||
const int row_length = entries_per_load > 0 ? M / entries_per_load : 0; | ||
|
||
const LType * in_aligned = reinterpret_cast<const LType *>(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: LType*
instead of LType *
as per Google Style Guide. Same for all other pointer declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check the alignment? Since CUDA uses force-alignment, it will potentially raise an error if the address of in
is not aligned with LType
. For example, DType
can be float32 and LType
can be double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sxjscience That is why the code that launches this kernel chooses LType based on the array dimensions - if the leading dimension is odd it will not choose LType larger than DType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the answer to this question but my concern is that we need to make sure that the following is true:
ASSERT(static_cast<size_t>(in) % sizeof(LType) == 0)
mainly due to the force-alignment constraint in CUDA (https://stackoverflow.com/questions/37323053/misaligned-address-in-cuda).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pointer given by cudaMalloc
is guaranteed to be aligned to something like 256B or more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for correcting me. Another question is: "Would it be possible to handle the cases when the N
in get_load_type(size_t N)
can not be divided by 8?" Could we load the majority of the elements using the vectorizing trick and just handle the remainders? (This is just a question and there is no need to revise it in this PR because I think it looks great).
@ptrendx Do you have any data on how much performance boost this change is introducing on applicable example workloads? |
Great! I'm considering to use this kind of warp-level primitives + vectorized load to accelerate our reduce function: (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh) |
@haojin2 For the perf improvement on end to end training - I did not measure yet, measured just the kernel speedup for now. BTW - how important do you think safe accumulation option is (as in, having an option to NOT do it instead of just always doing safe accumulation)? Personally I don't see value in not using it in softmax as it would most probably affect accuracy, and having 3 TYPE_SWITCH makes the compilation time quite big. You did not put the ability to skip it in |
@haojin2 I'm not sure how to progress with this PR - it seems that Windows CI instances do not have enough RAM to process all the templates here. The problem exists even with CPU compilation, where I just fixed your omission of MXNET_SAFE_ACCUMULATION=0 case in softmax with length. |
@sxjscience totally agree. This would provide a lot of benefit across the framework (for example the layernorm op). @ptrendx |
@marcoabreu Could you give some advice on that Windows CI problem - are the Windows builder instances much different than the unix ones? |
Yeah they are. but the heap space error is the same old problem that we just have too many macros I think. Basically we can't add any more operators because the file just grew too large. Try compiling locally on Unix with optimizations disabled and enable intermediary output. You will see some intermediary files being multiple gigabyte in size. |
@haojin2 had the same problem btw |
But the windows instances are c5.18xlarge. they have plenty of ram. We're literally running into the limitations of the compilers |
Would it help if we split the fwd and backward into different files? Those limitations are per file, right? |
I'd rather split operators into different files. Forward and backwards path kinda belong together, right? This could also lay the base for dynamic loading of operators where we could ship operators selectively. |
Thanks for the info Marco, appreciate you helping with this. I remember the MSVC team was very slow in moving their compiler process to 64 bit. Just to be clear they supported compiling 64 bit programs very early on, but the compiler process itself didn't need much ram so it stayed 32-bit. Apparently now they do have a 64 bit compiler process, but 32-bit is still the default. Is there any way we could check any of the windows hosts, and see if the compiler process they're using (CL.exe) is running in 32-bit mode in the task manager? Are any of the compilation processes actually using more than 4GB of ram? |
I remember that I checked it on Unix and that I saw high ram usage there. I'm currently not able to get data from windows, but the operator compilation has always been a bottleneck. |
There really is a pattern that these kinds of errors always come when people try to add new operators |
(or Newer) |
Ok, it seems that splitting softmax.cc into 3 files, 1 for each operator (softmax, softmin and log_softmax) did the trick fortunately. |
// By default temperature is 1.0. | ||
// Adding a branch here to save the CPU 'divide-by-1' computation at runtime | ||
DType final_result; | ||
if (temperature == 1.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this micro-opt really making things better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have done performance comparison earlier. This check speed up the operator by 30%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not touch the CPU code besides merging it into a 1 function from 2 to not introduce slowdowns in that path.
@@ -301,7 +282,7 @@ __global__ void softmax_compute_kernel(DType *in, OType *out, index_t M, int axi | |||
|
|||
red::sum::SetInitValue(smem[x]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does have multiple max values affect numerical accuracy? Or are they reduced at some other point to a final max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me rephrase. When running Reduce1D, are all the max values reduce into smem[0] / smax? as I understand the xmxax should be the max of {x_i}. It actually matters for numerical issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those intermediate max values are then reduced across threads to the final maximum.
in, out, M, axis, sshape, stride, temperature); | ||
MSHADOW_CUDA_POST_KERNEL_CHECK(softmax_compute_kernel); | ||
} | ||
DType my_max_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment or maybe a more descriptive name? is this the max of the stride?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest? This is the maximum value that this thread sees.
Thanks for refactoring the Softmax functions to make it into one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR optimizes Softmax implementation for cases where stride is 1 and the leading dimension is small (up to 20kB of data in that dimension).
There are 2 optimizations in this kernel compared to the previous one:
Compared to the previous implementation on fp16 I/O the new kernel is up to 4x faster.
@eric-haibin-lin
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.