-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NPU] refine update_loss_scaling npu kernel #32580
[NPU] refine update_loss_scaling npu kernel #32580
Conversation
Thanks for your contribution! |
auto g = out->mutable_data<T>(place); | ||
platform::NPUMemsetAsync(static_cast<void*>(g), 0, | ||
out->numel() * sizeof(T), stream); | ||
auto runner_zeros = NpuOpRunner("ZerosLike", {*out}, {*out}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutable_data is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
use
ZerosLike
andMemcpy
instead ofNPUMemsetAsync
.As shown in the timeline, there is a blank correspondding to
update_loss_scaling_op
caused byNPUMemsetAsync
.update_loss_scaling_op
cost about 103 ms.If only use
ZerosLike
to replaceNPUMemsetAsync
.update_loss_scaling_op
will launch manyZerosLike
NPU ops.update_loss_scaling_op
cost about 22.2 ms.update_loss_scaling_op
will launch only 1ZerosLike
NPU op, and then useMemcpy
to set tensor to 0.update_loss_scaling_op
cost about 5.5 ms.Performance
Speed up: 19448 tokens/s -> 20679 tokens/s, +6.33 %