Fix a computational problem of scaledSoftmax. #1096

yuanzexi · 2021-03-04T08:42:59Z

Problem: The original implementation results in wrong results of sum of softmax such that the results of BERT models (128 < seq_len < 384 and seq_len > 384) are very large or even 'nan', especially for FP16 mode.
Solution: This implementation fix the computational problem such that the results of BERT models (128 < seq_len < 384 and seq_len > 384) become correct.

Signed-off-by: yuanzexi hiyuanzexi@outlook.com

ttyio · 2021-03-05T02:53:49Z

plugin/common/common.cuh

-    cub::Sum sum;
-    float threadData(-FLT_MAX);
-
-    if (lastValid >= blockDim.x)


Could we keep this check here in case the kernel is used in scenario TPB > ld? thanks!

Yeah, you can keep it. Actually, this check confused me a lot. From my point of view, with this check, the initial value of threadData will be 0 in the scenario TPB <= lastValid but -FLT_MAX in the scenario TPB > lastValid. However, threadData will definitely be updated by threadData = max(static_cast<float>(input[idx]), threadData); in the next loop, so does this check is necessary for threads in the block? Or could we directly initialize the threadData by 0 instead of -FLT_MAX?

the threadData = max(static_cast<float>(input[idx]) is only called on condition i < lastValid, if TPB <= lastValid, we will accumulate the -FLT_MAX in next block reduced sum.
If we initialize the threadData to 0, then the threadData = max(static_cast<float>(input[idx]) is invalid if the input is negative.

From my point of view, maybe we just try to get the fmax here for BlockReduceMax? I'll set threadData as 0 before BlockReduceSum.

I mean line 347

ttyio · 2021-03-05T03:00:28Z

Adding @rajeevsrao for visibility.

Thanks @yuanzexi , good catch!

So you fixed 2 issues in this PR, right?

The fMax is wrong when TPB < ld and there are multiple blocks
The - fMax is missing in the final softmax calculation

The overall looks good to me. Could you follow https://github.com/NVIDIA/TensorRT/blob/master/CONTRIBUTING.md to reformat the code? thanks!

yuanzexi · 2021-03-05T06:40:44Z

Adding @rajeevsrao for visibility.

Thanks @yuanzexi , good catch!

So you fixed 2 issues in this PR, right?

The fMax is wrong when TPB < ld and there are multiple blocks

The - fMax is missing in the final softmax calculation

The overall looks good to me. Could you follow https://github.com/NVIDIA/TensorRT/blob/master/CONTRIBUTING.md to reformat the code? thanks!

Sorry about the code format, I'll format the code later.

yuanzexi · 2021-03-05T06:54:02Z

plugin/common/common.cuh

+  }
+  __syncthreads();
+
+  threadData = 0;


I'll set threadData as 0 here for all threads.

ttyio · 2021-03-05T07:09:18Z

plugin/common/common.cuh

+    threadData = max(static_cast<float>(input[idx]), threadData);
+  }
+
+  const float maxElem = BlockReduce(tmpStorage).Reduce(threadData, cub::Max());


some threadData is -FLT_MAX when the maxElem is computed here when TPB > ld

Oh! I see~ Thanks for answering!

The original implementation results in wrong results of sum of softmax such that the results of BERT models (128 < seq_len < 384 and seq_len > 384) are very large or even 'nan'. This implementation fix the computational problem such that the results of BERT models (128 < seq_len < 384 and seq_len > 384) become correct. Signed-off-by: yuanzexi <hiyuanzexi@outlook.com> Signed-off-by: yuanzexi <percyyuan@tencent.com>

Signed-off-by: yuanzexi <percyyuan@tencent.com>

Signed-off-by: yuanzexi <hiyuanzexi@outlook.com> Signed-off-by: yuanzexi <percyyuan@tencent.com>

yuanzexi · 2021-03-05T07:22:33Z

I have formated the code and the check we discussed has been added. Looking forward to your code review.

ttyio · 2021-03-05T08:14:41Z

plugin/common/common.cuh

@@ -35,7 +35,6 @@ __device__ inline T rsqrt(const T& x);
 template <typename T>
 __device__ inline T exp(const T& x);

-


could you also revert the changes of blank lines, this could help us to do the integration between public repo and internal repo, thanks!

No problem! I have reverted them. Looking forward to your review.

Signed-off-by: yuanzexi <percyyuan@tencent.com>

ttyio · 2021-03-05T08:43:24Z

plugin/common/common.cuh

@@ -326,7 +326,7 @@ template <typename T, unsigned TPB>
 __device__ inline void scaledSoftmax(
    const int ld, const int lastValid, const float rsqrtHeadSize, const T* input, T* output)
 {
-
+    


could you also remove the blanks in this line? thanks

ttyio · 2021-03-05T08:43:37Z

plugin/common/common.cuh

@@ -343,10 +343,11 @@ __device__ inline void scaledSoftmax(
    {
        threadData = 0;
    }
+


Signed-off-by: yuanzexi <percyyuan@tencent.com>

ttyio · 2021-03-05T09:45:25Z

LGTM, thanks @yuanzexi

Assign to @rajeevsrao , thanks!

rajeevsrao · 2021-03-05T14:10:28Z

Thanks for the fix @yuanzexi - will test internally and integrate.

yuanzexi mentioned this pull request Mar 4, 2021

[TensorRT] Problems relatived to the function 'scaledSoftmax' lead to wrong results of BERT models (SEQ_LEN > 128 and SEQ_LEN != 384) #1097

Closed

ttyio reviewed Mar 5, 2021

View reviewed changes

yuanzexi commented Mar 5, 2021

View reviewed changes

plugin/common/common.cuh Outdated

}

__syncthreads();

threadData = 0;

Copy link

Contributor Author

yuanzexi Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll set threadData as 0 here for all threads.

ttyio reviewed Mar 5, 2021

View reviewed changes

yuanzexi and others added 3 commits March 5, 2021 15:18

format the code

68aa7e0

Signed-off-by: yuanzexi <percyyuan@tencent.com>

add signed-off

168d5b5

Signed-off-by: yuanzexi <hiyuanzexi@outlook.com> Signed-off-by: yuanzexi <percyyuan@tencent.com>

ttyio reviewed Mar 5, 2021

View reviewed changes

revert blank lines.

9413590

Signed-off-by: yuanzexi <percyyuan@tencent.com>

ttyio reviewed Mar 5, 2021

View reviewed changes

remove redundant blank lines

00a26be

Signed-off-by: yuanzexi <percyyuan@tencent.com>

ttyio assigned rajeevsrao Mar 5, 2021

ttyio approved these changes Mar 5, 2021

View reviewed changes

rajeevsrao added bug Plugins Issues when using TensorRT plugins labels Mar 5, 2021

rajeevsrao merged commit 8c10371 into NVIDIA:master Mar 8, 2021

wellinxu mentioned this pull request Apr 13, 2021

Bert inference error:INTERNAL ERROR: Assertion failed: findIter != mFunctions.end() #1189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a computational problem of scaledSoftmax. #1096

Fix a computational problem of scaledSoftmax. #1096

yuanzexi commented Mar 4, 2021 •

edited

Loading

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio Mar 5, 2021

ttyio commented Mar 5, 2021

yuanzexi commented Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

yuanzexi commented Mar 5, 2021

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio Mar 5, 2021

yuanzexi Mar 5, 2021

ttyio commented Mar 5, 2021

rajeevsrao commented Mar 5, 2021

		@@ -35,7 +35,6 @@ __device__ inline T rsqrt(const T& x);
		template <typename T>
		__device__ inline T exp(const T& x);

Fix a computational problem of scaledSoftmax. #1096

Fix a computational problem of scaledSoftmax. #1096

Conversation

yuanzexi commented Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttyio commented Mar 5, 2021

yuanzexi commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanzexi commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttyio commented Mar 5, 2021

rajeevsrao commented Mar 5, 2021

yuanzexi commented Mar 4, 2021 •

edited

Loading