-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize softmax with cross entropy hard label #32290
Conversation
Thanks for your contribution! |
Performance optimized, computation time reduced:
|
if (ignore_index != tmp) { | ||
int64_t idx = idx_n * d + tmp * remain + idx_remain; | ||
logit_grad[idx] -= static_cast<T>(1.); | ||
__device__ __forceinline__ T logT(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logT -> Log,另外这个函数不用返回AccT
类型吗?
另外,原来的实现有TolerableValue
封装,对无穷值的处理,兼容一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。返回T即可。
return static_cast<T>(std::log(static_cast<AccT>(x))); | ||
} | ||
|
||
static inline int log2_ceil(int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log2_ceil -> Log2Ceil,另外这个函数也是来自softmax_cudnn_op.cu里面,建议放到公共头文件中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。建议后期和softmax一起整合。
} | ||
} | ||
} | ||
|
||
/* | ||
Core function of softmax with cross entropy forward hard label. | ||
Idea is similar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个comment,看的人就不知道和谁similar了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
|
||
int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize; | ||
int local_batches = batch_size - first_batch; | ||
if (local_batches > kBatchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种情况会出现吗?注意到softmax_cudnn_op.cu里面没有这个if。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经将softmax/softmax_with_cross_entropy 整合,WarpSoftmaxForward L112通过模板参数控制。
之前softmax_cudnn中采用了另一种处理方式,这种情况会出现。整合后代码此问题消失。
} | ||
|
||
// read data from global memory | ||
VecT srcdata[kBatchSize][kIterationsV]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意到softmax_cudnn_op.cu里面保存的是AccT类型,数组精度上可能会有略微的差异。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经将softmax/softmax_with_cross_entropy 整合,WarpSoftmaxForward L112通过模板参数控制。整合后代码此问题消失。
auto mode = axis == rank - 1 ? CUDNN_SOFTMAX_MODE_INSTANCE | ||
: CUDNN_SOFTMAX_MODE_CHANNEL; | ||
PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::cudnnSoftmaxForward( | ||
handle, CUDNN_SOFTMAX_ACCURATE, mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是调cudnn的话,应该还是先掉log模式比较好,因为softmax+cross_entropy融合计算不只是为了加快速度,而是避免了先exp后log计算带来的数值精度上的损失。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。先计算log。
SwitchWarpSoftmaxForwardHardLabel<T>( | ||
blocks, threads, ctx.cuda_device_context().stream(), loss_data, | ||
softmax_data, logits_data, labels_data, N, dim, dim, kDimLog2, | ||
ignore_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
softmax、log_softmax、softmax_with_cross_entropy最好使用相同的CUDA Kernel来启动,这避免launch代码也重复出现多次。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经整合,WarpSoftmaxForward L112通过模板参数控制。后期可以完全替换softmax_cudnn中实现,放到头文件。
#ifdef PADDLE_WITH_HIP | ||
miopenTensorDescriptor_t desc_ = desc.descriptor<T>(layout, tensor_dims); | ||
#else | ||
cudnnTensorDescriptor_t desc_ = desc.descriptor<T>(layout, tensor_dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是作为类成员,变量名最后不需要加下划线。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
} | ||
} | ||
} | ||
|
||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥要单独加个匿名的namespace呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。这个是之前存在的,Git没有比较出来。
@@ -886,9 +903,9 @@ class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel<T> { | |||
} else { | |||
auto* logits_data = logits->data<T>(); | |||
auto* labels_data = labels->data<int64_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numeric_stable_mode
属性代表了什么?上一个分支调的是cudnn的softmax。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是api的一个输入参数,这个PR没有修改这个分支的代码。
#ifdef PADDLE_WITH_HIP | ||
auto mode = axis == rank - 1 ? MIOPEN_SOFTMAX_MODE_INSTANCE | ||
: MIOPEN_SOFTMAX_MODE_CHANNEL; | ||
PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::miopenSoftmaxForward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
miopenSoftmaxForward -> miopenSoftmaxForward_V2,
PADDLE_ENFORCE_CUDA_SUCCESS(platform::dynload::miopenSoftmaxForward_V2(
handle, platform::CudnnDataType<T>::kOne(), desc_, logits_data,
platform::CudnnDataType<T>::kZero(), desc_, softmax_data));
platform::CudnnDataType<T>::kZero(), desc_, softmax_data,
MIOPEN_SOFTMAX_ACCURATE, mode));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文件属性不需要从644改到755
已修改。@luotao1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些代码层面优化的建议,可以后续PR再优化。当前PR主要确认计算逻辑,另外也请OP负责人来review下。
@@ -15,44 +15,446 @@ limitations under the License. */ | |||
#include <hipcub/hipcub.hpp> | |||
namespace cub = hipcub; | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文件权限不要改成755。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
if (labels[ids] == ignore_idx) { | ||
loss[ids] = static_cast<T>(0.0); | ||
} else { | ||
loss[ids] = -Log(softmax[idx]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
Lines 418 to 420 in f16981b
// It also would ignore labels not in range(class_num). | |
if (idx_axis != labels_[idx_lbl]) { | |
} else { |
原实现中这个判断还有对应吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段代码和原代码逻辑一样,处理ignore_index。只是把原代码中的两个kernel合并成为一个。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个Kernel是合并了原来的L400 HardLabelCrossEntropyFunctor
和L434 HardLabelCrossEntropyFunctorWithIgnoreIdx
吧,原来两个kernel的区别是是否检查label==ignore_index
。
原来这2个Kernel的判断条件分别是L419 if (idx_axis != labels_[idx_lbl])
和L455 if (idx_axis == labels_[idx_lbl] && idx_axis != ignore_idx_)
,这个kernel应该是少了一个判断条件了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改动,修改后代码和源代码逻辑一致。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。
我就说公式里没有这个判断逻辑。代码里面加一些解释,或者PR描述详细一些,会更便于理解和review。
shuffle api to compute max (sum) in one warp. | ||
*/ | ||
template <typename T, typename VecT, typename AccT, int Log2Elements, | ||
int CPMODE> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CP
是什么缩写?建议改成如下枚举类型定义方式:
enum class SoftmaxMode {
kSoftmax,
kLogSoftmax,
kCrossEntropy,
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
constexpr int kIterations = kDimCeil / kWarpSize; | ||
constexpr int kIterationsV = | ||
(kIterations >= kVSize) ? (kIterations / kVSize) : 1; | ||
constexpr int kBatchSize = (kDimCeil <= 128) ? 2 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
标记一下,这个batch_size的计算和softmax中不一样。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
会在整合softmax时一并处理。
#pragma unroll | ||
for (int i = 0; i < kBatchSize; ++i) { | ||
// it = 0 | ||
if (CPMODE == 1 || CPMODE == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode != kSoftmax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
auto stream = ctx.cuda_device_context().stream(); | ||
|
||
constexpr int max_dim = 320; | ||
constexpr int warps_per_block = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L382可以删除。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除。
} | ||
} | ||
} | ||
|
||
/* | ||
Hard label cross entropy with exp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数加注释说明一下,输入是啥、输出是啥。这里softmax应该是既作为输入、又作为输出,输入的数据应该是log_softmax,没有注释我就只有看到了调用处的代码才知道了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已加注释。
// thread ids compute loss[ids] using softmax[idx] | ||
if (idx < n * dim * d) { | ||
if (idx_dim == labels[ids]) { | ||
if (labels[ids] == ignore_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看原来的kernel里面都有对label取值范围的检查,这个kernel里面是不是也要加一下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段代码和原代码逻辑一样,处理ignore_index。只是把原代码中的两个kernel合并成为一个。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
Lines 532 to 536 in 202b0ea
PADDLE_ENFORCE(labels_[idx_lbl] >= 0 && labels_[idx_lbl] < d_ || | |
labels_[idx_lbl] == ignore_idx_, | |
"The value of label[%ld] expected >= 0 and < %ld, or == %d," | |
"but got %ld. Please check input value.", | |
idx_lbl, d_, ignore_idx_, labels_[idx_lbl]); |
如上述代码,理应都检查一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议不添加这段报错检查,消耗计算资源,如果输入错误,返回一个输出,也属合理。
原代码中大概有10处使用labels类似逻辑,只有1处有PADDLE_ENFORCE。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议不添加这段报错检查,消耗计算资源,如果输入错误,返回一个输出,也属合理。
原代码中大概有10处使用labels类似逻辑,只有1处有PADDLE_ENFORCE。
主要考虑的应该是这个函数里面是否需要加检查,而不是原代码中有没有。
此类op,如果用户数据不准确,没有对数据进行检查,代价就是用户训练了很久,发现模型不收敛,然后定位了很久才发现是数据有问题。加个检查本身也没有太大的开销。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,之后建PR处理这个问题。这个PR保持和原代码行为一致。
CrossEntropyHardLabel< | ||
T><<<blocks, threads, 0, context.cuda_device_context().stream()>>>( | ||
loss_data, logits_data, labels_data, n, axis_dim, d / axis_dim, | ||
ignore_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个分支,op benchmark没有测试到,最好加个配置测试下性能。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,后续加上。
*/ | ||
template <typename T> | ||
static void SoftmaxWithCrossEntropyHardLabel( | ||
const framework::ExecutionContext& ctx, int rank, int axis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不用从context里面拿输入、输出,就建议直接传platform::CUDADeviceContext
了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
Op负责人已经review过,线下已回复ok。 |
Sorry to inform you that 0029c65's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这类op精度比较敏感,建议跑一下模型,验证下收敛性。
if (labels[ids] == ignore_idx) { | ||
loss[ids] = static_cast<T>(0.0); | ||
} else { | ||
loss[ids] = -Log(softmax[idx]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原始代码总的threads数目是n * dim * remain (L642), 新代码总的threads数目是n * remain (L73), 所以少一个判断条件。
我就说公式里没有这个判断逻辑。代码里面加一些解释,或者PR描述详细一些,会更便于理解和review。
… develop-hardlabel
本地验证deeplabv3, amp模型通过。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
Ops
Describe
Op softmax_with_cross_entropy optimization label label. This PR includes: