improve amp training and fix nan error #8305
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
背景:ppyoloe、ppyolo、ppyolov2、picodet、mask_rcnn、tinypose在Paddle2.4版本上amp-o2会训练nan,或者精度无法对齐。本PR修复这几个模型,并且已经自测了当前修复后的精度,可以达到和fp32训练对齐的效果。主要修复的问题如下:
(1)一些模型特定的模块计算结果上溢出,不适合fp16:
(2)一些模型需要手动将部分算子输入提升到fp32类型:如PR中修改,某些模块依然会在Paddle2.5上出现溢出,但基于升级后的O2,我们只需要将算子输入手动cast到fp32就能保证算子使用fp32计算了。而在过去需要手动cast+auto_cast(enable=False)2步修改。
(3)一些模型梯度过小导致训练指标无法对齐:tinypose模型梯度非常小,套件采用了1024的初始loss scale,会导致梯度大部分在fp16表示下为0,参数得不到更新,表现在训练开始后loss没有下降也没有nan。另外在反向传播过程中权重梯度尽管可以通过调大的loss scale被放大,但在参数更新前需要unscale权重的梯度,此时权重梯度在fp16表示下依然会变为0,导致梯度更新受影响。因此我们新增了master_grad的功能解决此问题:当启用该功能后,每一个迭代,在反向传播结束,权重梯度会被cast到fp32,并且用这份fp32梯度替换原始的fp16梯度,以此来保证后续通过param.grad拿到的是fp32梯度,避免unscale后数值下溢出为0
(4)一些模型梯度过大导致训练nan:ppyolo、ppyolov2权重梯度较大,在grad_clip阶段,会导致在fp16计算下出现inf。该问题也可以通过启用master_grad去解决。
(5)EAM在amp-o2模式下,需要使用fp32存储权重,并且使用fp32计算更新权重:原始的实现EMA部分在O2模式下采用的fp16的权重存储和计算,随着迭代误差积累会发现训练loss正常下降,但是eval时指标接近于0。