-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cuda] Use cuMemsetD32 to fill scalar ndarray #3907
Conversation
✔️ Deploy Preview for jovial-fermat-aa59dc canceled. 🔨 Explore the source changes: 9e90eb1 🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61ce816fa359c00008619276 |
/format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred
@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT? |
@qiao-bo I don't think that's ideal. Not all APIs support immediate mode resource content filling. In addition, in hardware these commands are executed by the DMA / Transfer engine which uses a queue. Since all of this is on the C++ side, it is fine to have it under CommandList, it's reasonably fast. In addition, after we move everything into CommandList, we should expect a drop in overall host-side overhead as CUDA has a fully completed command graph to work with & uses less implicit sync. |
I agree with the logic. But seems like we have not implemented any device utilities to support immediate content filling for llvm backend? i see a potential major change here, does it still fit into this PR? |
/format |
/format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -604,5 +604,18 @@ uint64_t *LlvmProgramImpl::get_ndarray_alloc_info_ptr(DeviceAllocation &alloc) { | |||
return (uint64_t *)cpu_device()->get_alloc_info(alloc).ptr; | |||
} | |||
} | |||
|
|||
void LlvmProgramImpl::fill_ndarray(DeviceAllocation &alloc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiao-bo nit: Do we need mutable reference DeviceAllocation&
?
If not, let's make them const. Otherwise let's follow Google's style and use pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the Ndarray
fill
method from the previous kernel-based approach with a memset-based implementation.Performance improvement:
test_script.py
tested on RTX 3080.
iterations
Status:
This PR addresses only scalar ndarray on CUDA. Support of Matrix/Vector.ndarray and CPU will be followed in later PRs.