-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support async copy for TensorFromVector with event #32563
Conversation
Thanks for your contribution! |
✅ This PR's description meets the template requirements! |
fe750df
to
e23c489
Compare
VLOG(4) << " --------- liym27: TopDecoratedAllocator : " | ||
" decorated_allocators_.size() = " | ||
<< decorated_allocators_.size(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug code can be reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
9955b3f
to
854a11d
Compare
854a11d
to
dff54a0
Compare
30da90c
to
09c562a
Compare
09c562a
to
8bfd2f8
Compare
8bfd2f8
to
52c2ced
Compare
52c2ced
to
1fd121a
Compare
1fd121a
to
67a5fb8
Compare
67a5fb8
to
8e5a23b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
void* p; | ||
// PINNED memory is visible to all NPU contexts. | ||
auto result = aclrtMallocHost(&p, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be replaced with malloc in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will fix it next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sorry to inform you that 8e5a23b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
8e5a23b
to
c1ba6df
Compare
c1ba6df
to
4dc032d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Use malloc/free to replace aclrtMalloc/FreeHost Avoid vector->npu_pinned_tensor when numel==1 Polish code Fix compile error
4dc032d
to
2f6fc70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
Others
Describe
Support async copy for
TensorFromVector
andFillNpuTensorWithConstant
with event.1. 功能:
利用event 机制,在
TensorFromVector
和FillNpuTensorWithConstant
中支持异步数据拷贝。2. 收益:
2.1 正确性:保证异步拷贝数据,在拷贝完成前不会被析构
在cpu->npu 拷贝操作后,插入 event 到 stream,数据析构时,查询该 event 是否完成,未完成则延迟析构数据,保证"拷贝完 再析构"。
先将 cpu 数据 拷贝到一个 cpu tensor,aka
npu pinned tensor
,然后将npu pinned tensor
拷贝到 npu tensor。通过 NPUPinnedAllocator 管理 npu pinned tensor 的内存分配和释放。NPUPinnedAllocator 在 npu pinned tensor 对应的 event 完成后,再释放内存。该功能在 [NPU] Support npu pinned allocator #32840 中实现2.2 性能提升:
异步拷贝 比 同步拷贝,有性能提升。单机训练 ernie3.0,性能如下表所示:
本PR代码是方案4。