Softmax optimization for GPU #15545

ptrendx · 2019-07-15T23:59:33Z

Description

This PR optimizes Softmax implementation for cases where stride is 1 and the leading dimension is small (up to 20kB of data in that dimension).

There are 2 optimizations in this kernel compared to the previous one:

using of the longer datatypes for loading/writing data (so using up to 8B per read/write instead of e.g. 2B in case of fp16 I/O)
using persistent storage to reduce the number of memory accesses (previous implementation used 3 reads and 1 write, this implementation uses 1 read and 1 write).

Compared to the previous implementation on fp16 I/O the new kernel is up to 4x faster.

@eric-haibin-lin

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

karan6181 · 2019-07-16T22:58:18Z

@mxnet-label-bot add [Operator, pr-awaiting-review]

src/operator/nn/softmax-inl.h

eric-haibin-lin · 2019-07-17T05:56:50Z

src/operator/nn/softmax-inl.h

+const int softmax_threads_per_block = 512;
+
+template <typename OP, typename T>
+__device__ inline T warp_reduce(T value, OP redfun) {


this looks like a generic function that can be used elsewhere. Is there a better place to put this function?

Yup, will look into putting it in some better place.

It looks like common cuda utilities are put here https://github.com/apache/incubator-mxnet/blob/master/src/common/cuda_utils.h

In fact, there are other common functions that spread in other files, e.g., the wrapper of the warp-level primitives in https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/layer_norm.cu#L32-L50

We can also put it into https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh

haojin2 · 2019-07-17T16:34:43Z

src/operator/nn/softmax-inl.h

+  // the division by zero warning generated for such invalid cases.
+  const int row_length = entries_per_load > 0 ? M / entries_per_load : 0;
+
+  const LType * in_aligned = reinterpret_cast<const LType *>(in);


nit: LType* instead of LType * as per Google Style Guide. Same for all other pointer declarations.

Do we need to check the alignment? Since CUDA uses force-alignment, it will potentially raise an error if the address of in is not aligned with LType. For example, DType can be float32 and LType can be double

@sxjscience That is why the code that launches this kernel chooses LType based on the array dimensions - if the leading dimension is odd it will not choose LType larger than DType.

I'm not sure about the answer to this question but my concern is that we need to make sure that the following is true:

ASSERT(static_cast<size_t>(in) % sizeof(LType) == 0)

mainly due to the force-alignment constraint in CUDA (https://stackoverflow.com/questions/37323053/misaligned-address-in-cuda).

A pointer given by cudaMalloc is guaranteed to be aligned to something like 256B or more.

Thanks for correcting me. Another question is: "Would it be possible to handle the cases when the N in get_load_type(size_t N) can not be divided by 8?" Could we load the majority of the elements using the vectorizing trick and just handle the remainders? (This is just a question and there is no need to revise it in this PR because I think it looks great).

haojin2 · 2019-07-18T23:55:03Z

@ptrendx Do you have any data on how much performance boost this change is introducing on applicable example workloads?

src/operator/nn/softmax-inl.h

sxjscience · 2019-07-19T17:48:54Z

Great! I'm considering to use this kind of warp-level primitives + vectorized load to accelerate our reduce function: (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh)

ptrendx · 2019-07-22T17:46:59Z

@haojin2 For the perf improvement on end to end training - I did not measure yet, measured just the kernel speedup for now. BTW - how important do you think safe accumulation option is (as in, having an option to NOT do it instead of just always doing safe accumulation)? Personally I don't see value in not using it in softmax as it would most probably affect accuracy, and having 3 TYPE_SWITCH makes the compilation time quite big. You did not put the ability to skip it in softmax_with_length, should I remove it from regular softmax as well?

ptrendx · 2019-08-05T16:31:46Z

@haojin2 I'm not sure how to progress with this PR - it seems that Windows CI instances do not have enough RAM to process all the templates here. The problem exists even with CPU compilation, where I just fixed your omission of MXNET_SAFE_ACCUMULATION=0 case in softmax with length.

KellenSunderland · 2019-08-08T00:17:42Z

@sxjscience totally agree. This would provide a lot of benefit across the framework (for example the layernorm op).

@ptrendx
I see what you mean "fatal error C1002: compiler is out of heap space in pass 2". The CI windows machines should have a fair amount of RAM so this is a little strange.

ptrendx · 2019-08-09T18:38:30Z

@marcoabreu Could you give some advice on that Windows CI problem - are the Windows builder instances much different than the unix ones?

marcoabreu · 2019-08-09T19:50:42Z

Yeah they are. but the heap space error is the same old problem that we just have too many macros I think. Basically we can't add any more operators because the file just grew too large.

Try compiling locally on Unix with optimizations disabled and enable intermediary output. You will see some intermediary files being multiple gigabyte in size.

marcoabreu · 2019-08-09T19:51:31Z

@haojin2 had the same problem btw

marcoabreu · 2019-08-09T19:53:16Z

But the windows instances are c5.18xlarge. they have plenty of ram. We're literally running into the limitations of the compilers

ptrendx · 2019-08-09T20:18:45Z

Would it help if we split the fwd and backward into different files? Those limitations are per file, right?

marcoabreu · 2019-08-09T20:41:04Z

I'd rather split operators into different files. Forward and backwards path kinda belong together, right?

This could also lay the base for dynamic loading of operators where we could ship operators selectively.

KellenSunderland · 2019-08-09T22:31:45Z

Thanks for the info Marco, appreciate you helping with this.

I remember the MSVC team was very slow in moving their compiler process to 64 bit. Just to be clear they supported compiling 64 bit programs very early on, but the compiler process itself didn't need much ram so it stayed 32-bit. Apparently now they do have a 64 bit compiler process, but 32-bit is still the default. Is there any way we could check any of the windows hosts, and see if the compiler process they're using (CL.exe) is running in 32-bit mode in the task manager? Are any of the compilation processes actually using more than 4GB of ram?

marcoabreu · 2019-08-09T22:42:16Z

I remember that I checked it on Unix and that I saw high ram usage there. I'm currently not able to get data from windows, but the operator compilation has always been a bottleneck.

marcoabreu · 2019-08-09T22:42:46Z

There really is a pattern that these kinds of errors always come when people try to add new operators

src/common/cuda_utils.h

larroy · 2019-08-12T22:10:05Z

https://developercommunity.visualstudio.com/content/problem/284930/error-c1002-compiler-is-out-of-heap-space-in-pass.html We should update to VS 2017

larroy · 2019-08-12T22:10:13Z

(or Newer)

ptrendx · 2019-08-20T15:35:09Z

Ok, it seems that splitting softmax.cc into 3 files, 1 for each operator (softmax, softmin and log_softmax) did the trick fortunately.

larroy · 2019-08-20T23:59:56Z

src/operator/nn/softmax-inl.h

+      // By default temperature is 1.0.
+      // Adding a branch here to save the CPU 'divide-by-1' computation at runtime
+      DType final_result;
+      if (temperature == 1.0) {


is this micro-opt really making things better?

Yes, I have done performance comparison earlier. This check speed up the operator by 30%

I did not touch the CPU code besides merging it into a 1 function from 2 to not introduce slowdowns in that path.

larroy · 2019-08-21T00:07:07Z

src/operator/nn/softmax-inl.h

@@ -301,7 +282,7 @@ __global__ void softmax_compute_kernel(DType *in, OType *out, index_t M, int axi

  red::sum::SetInitValue(smem[x]);


Does have multiple max values affect numerical accuracy? Or are they reduced at some other point to a final max?

Not sure I understand.

That is not very numerically stable, so there is another step introduced that finds the maximum x_i first and compute

It does not matter which maximum you take.

let me rephrase. When running Reduce1D, are all the max values reduce into smem[0] / smax? as I understand the xmxax should be the max of {x_i}. It actually matters for numerical issues.

Yes, those intermediate max values are then reduced across threads to the final maximum.

larroy · 2019-08-21T00:08:24Z

src/operator/nn/softmax-inl.h

-      in, out, M, axis, sshape, stride, temperature);
-  MSHADOW_CUDA_POST_KERNEL_CHECK(softmax_compute_kernel);
-}
+  DType my_max_value;


Can we add a comment or maybe a more descriptive name? is this the max of the stride?

What would you suggest? This is the maximum value that this thread sees.

src/operator/nn/softmax-inl.h

apeforest · 2019-08-21T04:37:35Z

Thanks for refactoring the Softmax functions to make it into one.

larroy

LGTM

eric-haibin-lin self-requested a review July 16, 2019 04:51

marcoabreu added Operator pr-awaiting-review PR is waiting for code review labels Jul 16, 2019

eric-haibin-lin reviewed Jul 17, 2019

View reviewed changes

haojin2 reviewed Jul 17, 2019

View reviewed changes

ptrendx mentioned this pull request Jul 18, 2019

Softmax with length #15169

Merged

7 tasks

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Jul 19, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

ptrendx added 4 commits July 23, 2019 09:03

Softmax optimization

672af41

Fix lint

c54736c

Unifying softmax with length with regular softmax

da75ec1

Fix lint

40c0d00

ptrendx force-pushed the pr_softmax branch from 4c0a179 to 40c0d00 Compare July 23, 2019 16:05

Fixes from review

9c33c4b

ptrendx force-pushed the pr_softmax branch from 1609a2b to 9c33c4b Compare July 23, 2019 22:08

Making less templated kernels

99793ab

access2rohit reviewed Aug 10, 2019

View reviewed changes

src/common/cuda_utils.h Show resolved Hide resolved

DickJC123 mentioned this pull request Aug 15, 2019

Trial fix of pr_softmax [Experimental, do not merge] #15897

Closed

7 tasks

ptrendx added 4 commits August 19, 2019 15:12

Unifying softmaxgrad and softmaxwithlengthgrad

fea584d

Better gradient of softmax

77d52fd

Dividing softmax.cc into multiple files

7638340

Merge branch 'upstream' into pr_softmax

4d7462d

ptrendx changed the title ~~Softmax fwd optimization for GPU~~ Softmax optimization for GPU Aug 19, 2019

ptrendx added 2 commits August 19, 2019 16:18

Fix

b7670c2

Trigger CI

9a406d1

larroy reviewed Aug 20, 2019

View reviewed changes

larroy reviewed Aug 21, 2019

View reviewed changes

src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved

ptrendx added 4 commits August 21, 2019 09:44

Moving get_rows_per_block to common place

46175c0

Fix

9340a33

Fix lint

c020cfd

Actually fix lint

61a2aad

larroy approved these changes Aug 21, 2019

View reviewed changes

eric-haibin-lin merged commit bdeb7bc into apache:master Aug 21, 2019

ptrendx mentioned this pull request Aug 29, 2019

[Discussion] 1.6.0 Roadmap #15589

Closed

		@@ -301,7 +282,7 @@ __global__ void softmax_compute_kernel(DType in, OType out, index_t M, int axi

		red::sum::SetInitValue(smem[x]);

Softmax optimization for GPU #15545

Softmax optimization for GPU #15545

Conversation

ptrendx commented Jul 15, 2019 • edited Loading

Description

Checklist

Essentials

karan6181 commented Jul 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haojin2 commented Jul 18, 2019

sxjscience commented Jul 19, 2019 • edited Loading

ptrendx commented Jul 22, 2019

ptrendx commented Aug 5, 2019

KellenSunderland commented Aug 8, 2019

ptrendx commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

ptrendx commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

KellenSunderland commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

marcoabreu commented Aug 9, 2019

larroy commented Aug 12, 2019

larroy commented Aug 12, 2019

ptrendx commented Aug 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Aug 21, 2019

larroy left a comment

Choose a reason for hiding this comment

ptrendx commented Jul 15, 2019 •

edited

Loading

sxjscience Jul 22, 2019 •

edited

Loading

sxjscience Jul 19, 2019 •

edited

Loading

sxjscience commented Jul 19, 2019 •

edited

Loading