-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] support unified ckpt function #9312
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9312 +/- ##
===========================================
+ Coverage 52.80% 52.92% +0.11%
===========================================
Files 660 660
Lines 106869 106875 +6
===========================================
+ Hits 56434 56564 +130
+ Misses 50435 50311 -124 ☔ View full report in Codecov by Sentry. |
if paddle.is_compiled_with_xpu(): | ||
# XPU does not support all_reduce prod now, in XPU, bool is treated as int8, | ||
# so temporarily use reduce_min instead | ||
dist.all_reduce(local_resume, op=dist.ReduceOp.MIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那就统一改成dist.all_reduce(local_resume, op=dist.ReduceOp.MIN)吧,不需要特地区分XPU和GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if paddle.is_compiled_with_xpu(): | ||
# XPU does not support all_reduce prod now, in XPU, bool is treated as int8, | ||
# so temporarily use reduce_min instead | ||
dist.all_reduce(local_resume, op=dist.ReduceOp.MIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if not len(checkpoint_rng_state["cuda"]) == core.get_xpu_device_count(): | ||
raise ValueError("Length of xpu state list shoule be equal to the xpu device count") | ||
for i in range(core.get_xpu_device_count()): | ||
core.default_xpu_generator(i).set_state(checkpoint_rng_state["cuda"][i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 xpu 很下面 custom device 处理会有不同吗?更适合框架测修改吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在框架侧的device上XPU和GPU、CPU是同一级别的,custom device(如海光、昇腾等)是放在一起的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR types
Function optimization
PR changes
Others
Description
Support unified ckpt function on XPU