Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize softmax with cross entropy hard label #32290

Merged
merged 4 commits into from
May 21, 2021
Merged

optimize softmax with cross entropy hard label #32290

merged 4 commits into from
May 21, 2021

Conversation

xingfeng01
Copy link
Contributor

@xingfeng01 xingfeng01 commented Apr 15, 2021

PR types

Performance optimization

PR changes

Ops

Describe

Op softmax_with_cross_entropy optimization label label. This PR includes:

  • hard label forward kernel: use same idea of softmax's implementation: SoftmaxWithCrossEntropyHardLabel
  • hard label backward kernel: merge multiple kernels to one kernel: SoftmaxWithCrossEntropyGradHardLabel

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xingfeng01
Copy link
Contributor Author

Performance optimized, computation time reduced:

  • softmax_with_cross_entropy_1 (forward) : -78.9%
  • softmax_with_cross_entropy_1 (backward): -89.2%

if (ignore_index != tmp) {
int64_t idx = idx_n * d + tmp * remain + idx_remain;
logit_grad[idx] -= static_cast<T>(1.);
__device__ __forceinline__ T logT(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logT -> Log,另外这个函数不用返回AccT类型吗?

另外,原来的实现有TolerableValue封装,对无穷值的处理,兼容一下。

Copy link
Contributor Author

@xingfeng01 xingfeng01 Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。返回T即可。

return static_cast<T>(std::log(static_cast<AccT>(x)));
}

static inline int log2_ceil(int value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log2_ceil -> Log2Ceil,另外这个函数也是来自softmax_cudnn_op.cu里面,建议放到公共头文件中。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。建议后期和softmax一起整合。

}
}
}

/*
Core function of softmax with cross entropy forward hard label.
Idea is similar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个comment,看的人就不知道和谁similar了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。


int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize;
int local_batches = batch_size - first_batch;
if (local_batches > kBatchSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种情况会出现吗?注意到softmax_cudnn_op.cu里面没有这个if。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经将softmax/softmax_with_cross_entropy 整合,WarpSoftmaxForward L112通过模板参数控制。
之前softmax_cudnn中采用了另一种处理方式,这种情况会出现。整合后代码此问题消失。

}

// read data from global memory
VecT srcdata[kBatchSize][kIterationsV];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意到softmax_cudnn_op.cu里面保存的是AccT类型,数组精度上可能会有略微的差异。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经将softmax/softmax_with_cross_entropy 整合,WarpSoftmaxForward L112通过模板参数控制。整合后代码此问题消失。

auto mode = axis == rank - 1 ? CUDNN_SOFTMAX_MODE_INSTANCE
: CUDNN_SOFTMAX_MODE_CHANNEL;
PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::cudnnSoftmaxForward(
handle, CUDNN_SOFTMAX_ACCURATE, mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是调cudnn的话,应该还是先掉log模式比较好,因为softmax+cross_entropy融合计算不只是为了加快速度,而是避免了先exp后log计算带来的数值精度上的损失。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。先计算log。

SwitchWarpSoftmaxForwardHardLabel<T>(
blocks, threads, ctx.cuda_device_context().stream(), loss_data,
softmax_data, logits_data, labels_data, N, dim, dim, kDimLog2,
ignore_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

softmax、log_softmax、softmax_with_cross_entropy最好使用相同的CUDA Kernel来启动,这避免launch代码也重复出现多次。

Copy link
Contributor Author

@xingfeng01 xingfeng01 Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经整合,WarpSoftmaxForward L112通过模板参数控制。后期可以完全替换softmax_cudnn中实现,放到头文件。

#ifdef PADDLE_WITH_HIP
miopenTensorDescriptor_t desc_ = desc.descriptor<T>(layout, tensor_dims);
#else
cudnnTensorDescriptor_t desc_ = desc.descriptor<T>(layout, tensor_dims);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是作为类成员,变量名最后不需要加下划线。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

}
}
}

namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥要单独加个匿名的namespace呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。这个是之前存在的,Git没有比较出来。

@@ -886,9 +903,9 @@ class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel<T> {
} else {
auto* logits_data = logits->data<T>();
auto* labels_data = labels->data<int64_t>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numeric_stable_mode属性代表了什么?上一个分支调的是cudnn的softmax。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是api的一个输入参数,这个PR没有修改这个分支的代码。

#ifdef PADDLE_WITH_HIP
auto mode = axis == rank - 1 ? MIOPEN_SOFTMAX_MODE_INSTANCE
: MIOPEN_SOFTMAX_MODE_CHANNEL;
PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::miopenSoftmaxForward(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miopenSoftmaxForward -> miopenSoftmaxForward_V2,

PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::miopenSoftmaxForward_V2(
         handle, platform::CudnnDataType<T>::kOne(), desc_, logits_data,
        platform::CudnnDataType<T>::kZero(), desc_, softmax_data));
        platform::CudnnDataType<T>::kZero(), desc_, softmax_data,
        MIOPEN_SOFTMAX_ACCURATE, mode));

https://rocmsoftwareplatform.github.io/MIOpen/doc/html/softmax.html?highlight=miopensoftmaxforward#miopensoftmaxforward-v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

@xingfeng01 xingfeng01 requested a review from Xreki April 26, 2021 02:38
Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件属性不需要从644改到755

@xingfeng01
Copy link
Contributor Author

xingfeng01 commented Apr 26, 2021

文件属性不需要从644改到755

已修改。@luotao1

@xingfeng01 xingfeng01 requested a review from luotao1 April 26, 2021 03:30
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些代码层面优化的建议,可以后续PR再优化。当前PR主要确认计算逻辑,另外也请OP负责人来review下。

@@ -15,44 +15,446 @@ limitations under the License. */
#include <hipcub/hipcub.hpp>
namespace cub = hipcub;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件权限不要改成755。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

if (labels[ids] == ignore_idx) {
loss[ids] = static_cast<T>(0.0);
} else {
loss[ids] = -Log(softmax[idx]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// It also would ignore labels not in range(class_num).
if (idx_axis != labels_[idx_lbl]) {
} else {

原实现中这个判断还有对应吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码和原代码逻辑一样,处理ignore_index。只是把原代码中的两个kernel合并成为一个。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个Kernel是合并了原来的L400 HardLabelCrossEntropyFunctor和L434 HardLabelCrossEntropyFunctorWithIgnoreIdx吧,原来两个kernel的区别是是否检查label==ignore_index

原来这2个Kernel的判断条件分别是L419 if (idx_axis != labels_[idx_lbl])和L455 if (idx_axis == labels_[idx_lbl] && idx_axis != ignore_idx_),这个kernel应该是少了一个判断条件了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。

Copy link
Contributor Author

@xingfeng01 xingfeng01 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改动,修改后代码和源代码逻辑一致。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。

我就说公式里没有这个判断逻辑。代码里面加一些解释,或者PR描述详细一些,会更便于理解和review。

shuffle api to compute max (sum) in one warp.
*/
template <typename T, typename VecT, typename AccT, int Log2Elements,
int CPMODE>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CP是什么缩写?建议改成如下枚举类型定义方式:

enum class SoftmaxMode {
  kSoftmax,
  kLogSoftmax,
  kCrossEntropy,
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

constexpr int kIterations = kDimCeil / kWarpSize;
constexpr int kIterationsV =
(kIterations >= kVSize) ? (kIterations / kVSize) : 1;
constexpr int kBatchSize = (kDimCeil <= 128) ? 2 : 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标记一下,这个batch_size的计算和softmax中不一样。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

会在整合softmax时一并处理。

#pragma unroll
for (int i = 0; i < kBatchSize; ++i) {
// it = 0
if (CPMODE == 1 || CPMODE == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode != kSoftmax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

auto stream = ctx.cuda_device_context().stream();

constexpr int max_dim = 320;
constexpr int warps_per_block = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L382可以删除。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除。

}
}
}

/*
Hard label cross entropy with exp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数加注释说明一下,输入是啥、输出是啥。这里softmax应该是既作为输入、又作为输出,输入的数据应该是log_softmax,没有注释我就只有看到了调用处的代码才知道了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已加注释。

// thread ids compute loss[ids] using softmax[idx]
if (idx < n * dim * d) {
if (idx_dim == labels[ids]) {
if (labels[ids] == ignore_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看原来的kernel里面都有对label取值范围的检查,这个kernel里面是不是也要加一下?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码和原代码逻辑一样,处理ignore_index。只是把原代码中的两个kernel合并成为一个。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PADDLE_ENFORCE(labels_[idx_lbl] >= 0 && labels_[idx_lbl] < d_ ||
labels_[idx_lbl] == ignore_idx_,
"The value of label[%ld] expected >= 0 and < %ld, or == %d,"
"but got %ld. Please check input value.",
idx_lbl, d_, ignore_idx_, labels_[idx_lbl]);

如上述代码,理应都检查一下。

Copy link
Contributor Author

@xingfeng01 xingfeng01 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议不添加这段报错检查,消耗计算资源,如果输入错误,返回一个输出,也属合理。
原代码中大概有10处使用labels类似逻辑,只有1处有PADDLE_ENFORCE。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议不添加这段报错检查,消耗计算资源,如果输入错误,返回一个输出,也属合理。
原代码中大概有10处使用labels类似逻辑,只有1处有PADDLE_ENFORCE。

主要考虑的应该是这个函数里面是否需要加检查,而不是原代码中有没有。
此类op,如果用户数据不准确,没有对数据进行检查,代价就是用户训练了很久,发现模型不收敛,然后定位了很久才发现是数据有问题。加个检查本身也没有太大的开销。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,之后建PR处理这个问题。这个PR保持和原代码行为一致。

CrossEntropyHardLabel<
T><<<blocks, threads, 0, context.cuda_device_context().stream()>>>(
loss_data, logits_data, labels_data, n, axis_dim, d / axis_dim,
ignore_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支,op benchmark没有测试到,最好加个配置测试下性能。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,后续加上。

*/
template <typename T>
static void SoftmaxWithCrossEntropyHardLabel(
const framework::ExecutionContext& ctx, int rank, int axis,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不用从context里面拿输入、输出,就建议直接传platform::CUDADeviceContext了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

@xingfeng01
Copy link
Contributor Author

一些代码层面优化的建议,可以后续PR再优化。当前PR主要确认计算逻辑,另外也请OP负责人来review下。

Op负责人已经review过,线下已回复ok。

@xingfeng01 xingfeng01 requested a review from Xreki May 6, 2021 01:28
@paddle-bot-old
Copy link

paddle-bot-old bot commented May 8, 2021

Sorry to inform you that 0029c65's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这类op精度比较敏感,建议跑一下模型,验证下收敛性。

if (labels[ids] == ignore_idx) {
loss[ids] = static_cast<T>(0.0);
} else {
loss[ids] = -Log(softmax[idx]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。

我就说公式里没有这个判断逻辑。代码里面加一些解释,或者PR描述详细一些,会更便于理解和review。

@xingfeng01
Copy link
Contributor Author

本地验证deeplabv3, amp模型通过。

@xingfeng01 xingfeng01 requested a review from Xreki May 21, 2021 05:39
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xreki Xreki merged commit 7be6191 into PaddlePaddle:develop May 21, 2021
Xreki added a commit that referenced this pull request Jun 4, 2021
Xreki added a commit that referenced this pull request Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants