-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support hybrid_parallel_topo_order for auto parallel Llama #8011
Support hybrid_parallel_topo_order for auto parallel Llama #8011
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8011 +/- ##
===========================================
- Coverage 56.55% 56.51% -0.04%
===========================================
Files 592 592
Lines 91055 91126 +71
===========================================
+ Hits 51492 51499 +7
- Misses 39563 39627 +64 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for paddlenlp/ops/distributed/utils/topo.py
paddlenlp/trainer/training_args.py
…nto support-hybrid-parallel-topo-order-for-auto-parallel-llama
PR types
New features
PR changes
Models
Description
静半llama模型适配
hybrid_parallel_topo_order
参数,混合并行默认拓扑顺序修改为["pp", "dp", "mp"]
,和动态图hybrid_parallel_topo_order=="pp_first"
的情况对齐,仅在设置hybrid_parallel_topo_order=="sharding_first"
时才保留原来的顺序["dp", "pp", "mp"]
。【调整拓扑顺序影响精度问题】
此修改非预期地触发了静半旧组网和动静统一组网用于随机初始化参数的种子改变,从而导致了运行loss改变。
静半旧组网: 使用PaddleNLP里_get_distributed_seeds方法生成随机种子,传入topo顺序写死了
[dp,pp]
。本PR对用于初始化随机种子的Topology
类型进行了升级,从只支持["dp", "pp", "sharding", "mp", "sep"]
升级为支持传入任意拓扑顺序,以保持不同拓扑顺序下的loss精度不变。动静统一组网: 使用框架里determinate_rng方法生成随机种子,该方法依赖mesh的全局自增id构造随机种子。取pp维mesh的操作接口
get_mesh_with_dim
引入mesh的全局自增id偏移,导致随机种子改变。本PR配合框架PR PaddlePaddle/Paddle#62125 对get_mesh_with_dim
操作进行改写,避免全局自增id的改变导致loss改变。更多细节详见框架PR描述。【动静统一组网的收敛性验证与CI监控loss更新】
静半新组网
get_mesh_with_dim
接口改写后,调换拓扑顺序可保证mesh自增id相同,loss不改变,但CI上动半模型之前基于旧id跑出的baseline loss需要修改,在本PR中一并进行更新。针对当前CI上监控的case:
选取动半fp32 dp2、dp2-mp2、dp2-mp2-pp2 以下3组任务进行收敛性验证(旧组网下修复后精度不改变、不需要验旧组网,相关改动不影响fp16逻辑,只需要验fp32):
收敛曲线如下:
dp2-mp2-pp2:
dp2-mp2:
dp2:
在本PR测试期间,有两个PR引入预期内的精度改变,没有被正确拦截到,在本PR中一并对受影响的case进行loss更新。
SwiGLU:#8038
影响case:
master_grad修改:PaddlePaddle/Paddle#62276
影响case: