[XPU] support unified ckpt function #9312

cqulilujia · 2024-10-24T09:04:37Z

PR types

Function optimization

PR changes

Others

Description

Support unified ckpt function on XPU

paddle-bot · 2024-10-24T09:04:43Z

Thanks for your contribution!

codecov · 2024-10-24T09:38:03Z

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 52.92%. Comparing base (7551730) to head (9cdabb6).
Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/trainer.py	0.00%	5 Missing ⚠️
paddlenlp/trainer/plugins/unified_checkpoint.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9312      +/-   ##
===========================================
+ Coverage    52.80%   52.92%   +0.11%     
===========================================
  Files          660      660              
  Lines       106869   106875       +6     
===========================================
+ Hits         56434    56564     +130     
+ Misses       50435    50311     -124

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DesmonDay · 2024-10-24T09:45:37Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+    if paddle.is_compiled_with_xpu():
+        # XPU does not support all_reduce prod now, in XPU, bool is treated as int8,
+        # so temporarily use reduce_min instead
+        dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)


那就统一改成dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)吧，不需要特地区分XPU和GPU

DesmonDay · 2024-10-24T09:45:49Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+        if paddle.is_compiled_with_xpu():
+            # XPU does not support all_reduce prod now, in XPU, bool is treated as int8,
+            # so temporarily use reduce_min instead
+            dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)


DesmonDay

LGTM

zhiqiu

LGTM

ZHUI · 2024-10-25T06:55:50Z

paddlenlp/trainer/trainer.py

+            if not len(checkpoint_rng_state["cuda"]) == core.get_xpu_device_count():
+                raise ValueError("Length of xpu state list shoule be equal to the xpu device count")
+            for i in range(core.get_xpu_device_count()):
+                core.default_xpu_generator(i).set_state(checkpoint_rng_state["cuda"][i])


这里 xpu 很下面 custom device 处理会有不同吗？更适合框架测修改吧。

在框架侧的device上XPU和GPU、CPU是同一级别的，custom device（如海光、昇腾等）是放在一起的

paddle-bot bot added the XPU label Oct 24, 2024

DesmonDay reviewed Oct 24, 2024

View reviewed changes

cqulilujia force-pushed the unified branch from c0c6f34 to cfc24b1 Compare October 24, 2024 10:43

DesmonDay previously approved these changes Oct 24, 2024

View reviewed changes

cqulilujia dismissed DesmonDay’s stale review via 2ec9a6d October 24, 2024 10:48

cqulilujia force-pushed the unified branch from cfc24b1 to 2ec9a6d Compare October 24, 2024 10:48

[XPU], support unified ckpt function

9cdabb6

cqulilujia force-pushed the unified branch from 2ec9a6d to 9cdabb6 Compare October 24, 2024 10:49

zhiqiu approved these changes Oct 24, 2024

View reviewed changes

cqulilujia changed the title ~~[XPU], support unified ckpt function~~ [XPU] support unified ckpt function Oct 25, 2024

wawltor merged commit b237ba7 into PaddlePaddle:develop Oct 25, 2024
9 of 12 checks passed

ZHUI reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] support unified ckpt function #9312

[XPU] support unified ckpt function #9312

cqulilujia commented Oct 24, 2024

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

DesmonDay Oct 24, 2024

cqulilujia Oct 24, 2024

DesmonDay Oct 24, 2024

DesmonDay left a comment

zhiqiu left a comment

ZHUI Oct 25, 2024

cqulilujia Oct 25, 2024 •

edited

Loading

cqulilujia Oct 25, 2024

[XPU] support unified ckpt function #9312

[XPU] support unified ckpt function #9312

Conversation

cqulilujia commented Oct 24, 2024

PR types

PR changes

Description

paddle-bot bot commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

DesmonDay Oct 24, 2024

Choose a reason for hiding this comment

cqulilujia Oct 24, 2024

Choose a reason for hiding this comment

DesmonDay Oct 24, 2024

Choose a reason for hiding this comment

DesmonDay left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

ZHUI Oct 25, 2024

Choose a reason for hiding this comment

cqulilujia Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

cqulilujia Oct 25, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2024 •

edited

Loading

cqulilujia Oct 25, 2024 •

edited

Loading