Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hybrid_parallel_topo_order for auto parallel Llama #8011

Conversation

From00
Copy link
Collaborator

@From00 From00 commented Feb 23, 2024

PR types

New features

PR changes

Models

Description

静半llama模型适配hybrid_parallel_topo_order参数,混合并行默认拓扑顺序修改为["pp", "dp", "mp"],和动态图hybrid_parallel_topo_order=="pp_first"的情况对齐,仅在设置hybrid_parallel_topo_order=="sharding_first"时才保留原来的顺序["dp", "pp", "mp"]

【调整拓扑顺序影响精度问题】
此修改非预期地触发了静半旧组网和动静统一组网用于随机初始化参数的种子改变,从而导致了运行loss改变。
静半旧组网: 使用PaddleNLP里_get_distributed_seeds方法生成随机种子,传入topo顺序写死了[dp,pp]。本PR对用于初始化随机种子的Topology类型进行了升级,从只支持["dp", "pp", "sharding", "mp", "sep"]升级为支持传入任意拓扑顺序,以保持不同拓扑顺序下的loss精度不变。
动静统一组网: 使用框架里determinate_rng方法生成随机种子,该方法依赖mesh的全局自增id构造随机种子。取pp维mesh的操作接口get_mesh_with_dim引入mesh的全局自增id偏移,导致随机种子改变。本PR配合框架PR PaddlePaddle/Paddle#62125get_mesh_with_dim操作进行改写,避免全局自增id的改变导致loss改变。更多细节详见框架PR描述。

【动静统一组网的收敛性验证与CI监控loss更新】
静半新组网get_mesh_with_dim接口改写后,调换拓扑顺序可保证mesh自增id相同,loss不改变,但CI上动半模型之前基于旧id跑出的baseline loss需要修改,在本PR中一并进行更新。
针对当前CI上监控的case:

function llama_case_list_auto() {
    llama_dygraph_auto_bs8_fp32_DP2
    llama_dygraph_auto_bs8_fp32_DP2-MP2
    llama_dygraph_auto_bs8_fp32_DP2-MP2-PP2
    llama_dygraph_auto_bs8_fp16_DP2-MP2-PP2

    llama_static_auto_recompute_bs8_fp32_DP1-MP1-PP1
    llama_static_auto_recompute_bs16_fp32_DP2-MP1-PP1
    llama_static_auto_recompute_bs16_fp32_DP2-MP2-PP1
    llama_static_auto_recompute_bs16_fp32_DP2-MP2-PP2
    llama_static_auto_recompute_bs16_fp32_DP2-MP2-PP2-VPP2-Sharding2_stage2
    llama_static_auto_recompute_bs16_fp16_DP2-MP2-PP2-VPP2-Sharding2_stage2
}

选取动半fp32 dp2、dp2-mp2、dp2-mp2-pp2 以下3组任务进行收敛性验证(旧组网下修复后精度不改变、不需要验旧组网,相关改动不影响fp16逻辑,只需要验fp32):

    llama_dygraph_auto_bs8_fp32_DP2
    llama_dygraph_auto_bs8_fp32_DP2-MP2
    llama_dygraph_auto_bs8_fp32_DP2-MP2-PP2

收敛曲线如下:
dp2-mp2-pp2:
image
image

dp2-mp2:
image
image

dp2:
image
image

在本PR测试期间,有两个PR引入预期内的精度改变,没有被正确拦截到,在本PR中一并对受影响的case进行loss更新。
SwiGLU:#8038
影响case:

llama_static_auto_recompute_bs16_fp32_DP2-MP1-PP1
llama_static_auto_recompute_bs16_fp32_DP2-MP2-PP1
llama_static_auto_recompute_bs16_fp32_DP2-MP2-PP2-VPP2-Sharding2_stage2

master_grad修改:PaddlePaddle/Paddle#62276
影响case:

llama_dygraph_auto_bs8_fp16_DP2-MP2-PP2

Copy link

paddle-bot bot commented Feb 23, 2024

Thanks for your contribution!

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 0% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 56.51%. Comparing base (4b1c54b) to head (60ba56b).
Report is 10 commits behind head on develop.

Files Patch % Lines
paddlenlp/ops/distributed/utils/topo.py 0.00% 26 Missing ⚠️
paddlenlp/trainer/training_args.py 0.00% 11 Missing ⚠️
paddlenlp/transformers/llama/modeling_auto.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8011      +/-   ##
===========================================
- Coverage    56.55%   56.51%   -0.04%     
===========================================
  Files          592      592              
  Lines        91055    91126      +71     
===========================================
+ Hits         51492    51499       +7     
- Misses       39563    39627      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@haohongxiang haohongxiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ZHUI
ZHUI previously approved these changes Mar 4, 2024
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
for paddlenlp/ops/distributed/utils/topo.py
paddlenlp/trainer/training_args.py

From00 added 2 commits March 4, 2024 19:59
…nto support-hybrid-parallel-topo-order-for-auto-parallel-llama
@wawltor wawltor merged commit b504a73 into PaddlePaddle:develop Mar 8, 2024
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants