-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory}
#4741
[gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory}
#4741
Conversation
if (device_end_offset < device_begin_offset) { | ||
return false; | ||
} | ||
const T* begin = ptr() + device_begin_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this line to DeviceMemory as well? That'll keep DeviceMemory from downloading from a random location accidentally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment!
I think this would be possible by altering the function signature of DeviceMemory::download
. The DeviceMemory
class acts as blob storage and does not know the type of elements it stores. Defining memory locations to sync within member functions is thus not feasible currently. However, if the API of the memory device was DeviceMemory::Download(host_ptr, device_begin_offset, device_end_offset, elem_size)
, we can include the lines you mentioned in the DeviceMemory class. However, I am not sure this is useful. Firstly, our cudaSafeCall macros wrapping Cuda interactions report errors, and users can guard themselves against this by using try-catch statements. Secondly, I did not see any PCL code using DeviceMemory directly. Instead, all interactions are through the DeviceArray class. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users can guard themselves against this by using try-catch statements.
If there's a potentially safer API where such counter-measures aren't required on the user's end, then we should aim for that. PCL has a lot of baggage, let's not add another item for a future cleanup 😆
Secondly, I did not see any PCL code using DeviceMemory directly
It's public API. Even if we don't use it, someone might be using it. Another point is that if we can have a consistent API, then it's friendlier for downstream users.
Since DeviceMemory
(DM) works on bytes, and not datatype T
, we can have the following API to mirror the DeviceArray
(DA) API:
DM::download(host_ptr, device_begin_byte_offset, device_end_byte_offset)
My comments are relevant mostly to the public API. The private/protected API can be more raw. Eg:
DA::download/3: check_offset/2 && DM::download_protected/3
DM::download/3: check_offset/2 && DM::download_protected/3
DM::download_protected_3: cudamemcpy/4 && cudaSync0
(please forgive the weird syntax used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply! Your suggestion for the DM::download
signature is very useful and I will try to implement it - thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your comments, @kunaltyagi! I tried to address your concerns, but I am not sure my implementation is proper. As the device memory is type agnostic, we cannot perform pointer arithmetic to calculate the position of the data to be downloaded. Instead, I cast the void*
pointer to a char*
to perform the required arithmetic. I am not sure if this is idiomatic code or if a better solution exists. What do you think about the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast to char* and back to void* is inevitable due to the existing code. Do you think we need to make the conversion to void* explicit (using static_cast)?
Rest LGTM (already approved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic - thank you so much for your review, @kunaltyagi !
Device{Array,Memory}
Device{Array,Memory}
Device{Array,Memory}
I just checked the API. What are your thoughts on adding an equivalent PS: It's not in our current use-case (the Euclidean clustering PR) |
I think this is an intriguing idea! I began experimenting with it to judge how useful this would be (I hope to avoid some memory allocations on the device). I will soon reply to your question with a more educated opinion. |
I think having the upload functionality is a wonderful idea. The STL has the agnostic copy function, and I don't see any reason why our users should only be able to download parts of the device array but not upload to parts of it. So we introduce: We could even think of unifying these two function with a |
Iterator and copy function can be another PR post discussion (without the destination required, since iterators have that info) |
Thanks for your comments, @kunaltyagi! I have added the upload functionality and think this is a good addition - thanks for the suggestion! I have not factored the functionality in |
Thanks a lot for reviewing and approving the changes @kunaltyagi! It was again great fun to work on this PR. Just understand how to proceed: Should I merge this branch into the branch used for #4677 to address Lars' comments? I am not used to working on such large projects and sometimes get confused about dealing with different branches. |
No. This will get merged first. Then you can rebase your older PR on top of the new master |
Sorry for the delay.
|
I missed the create call. And the begin_idx and (unsigned) num_elements sounds like a nice interface :) |
Thank you, @mvieth and @kunaltyagi, for your comments! These are very helpful indeed - I will think about them and write a more detailed response! |
Thanks for your comments, Markus! These prompted me to reflect and I think I can address both now: Ad 1: i) The suggested public API is ii) I think we could check early in Ad 2: Does that address your questions, Markus? I hope I understood them correctly! |
1.i) Start+end iterators (and as the next best thing, start+end pointers) are not really used elsewhere in |
Thank you, Markus, for your detailed comments. All this sounds convincing to me, and I'll happily implement that - thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments on the new changes, mostly minor stuff
gpu/containers/src/device_memory.cpp
Outdated
const void* const begin = static_cast<char*>(data_) + device_begin_byte_offset; | ||
const char* const download_end = static_cast<const char*>(begin) + num_bytes; | ||
const char* const array_end = static_cast<char*>(data_) + sizeBytes_; | ||
if (download_end > array_end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I think (device_begin_byte_offset + num_bytes) > sizeBytes_
should have the same effect, and is IMO a bit more readable (same for upload)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I think you are correct, and it's much more readable. I will think about its correctness a bit more and apply the changes. Thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an excellent comment - thank you, Markus! Your remark also made me realize that the "const-ness" of the function arguments was wrong. I hope/think it is correct now. Again, thank you!
gpu/containers/src/device_memory.cpp
Outdated
if (upload_end > array_end) { | ||
return false; | ||
} | ||
cudaSafeCall(cudaMemcpy(host_ptr_arg, begin, num_bytes, cudaMemcpyHostToDevice)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think host_ptr_arg
and begin
have to be switched (compare other upload function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm..., I have to think about that but at first glance I would say you are right... Strange that this didn't result in an error when I thought I tested it. Thanks for pointing this out!
/** \brief Uploads data from CPU memory to device array. Please note | ||
* that this overload never allocates memory in contrast to the | ||
* other upload function. | ||
* Returns true if upload successfull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Returns true if upload successfull | |
* Returns true if upload successful |
Once more below, also consider doxygen's \return
tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed - thanks for the comment!
Thanks for your comments, Markus and Kunal! They helped me a lot. Your remarks about Doxygen highlighted the disparity between return types in the existing upload/download functions and new overloads: The current functions return void, but the new functions return a bool. Should we harmonize this? If so, how? |
Since the formatting pull request got merged, there are a few conflicts. Could you resolve them please? |
Make old functions return bool as well :) |
5e8ca77
to
518503b
Compare
Thanks for the info, Markus! I re-based the branch, and I think it's ready for a merge. I initially expected to add a bool return for the old functions, too. However, I don't see good conditions to check (other than maybe inspecting the Cuda error flags, which is a duplication as they are noisy anyway). What do you think about that? |
|
…g `Device{Array,Memory}` (PointCloudLibrary#4741)
…g `Device{Array,Memory}` (PointCloudLibrary#4741)
This commit tries to address #4689, in line with the comments received in #4677.
Users can now download an interval range of data from the device array to the host instead of downloading the entire device array. This provides users with greater flexibility when interacting with the device array, potentially speeds up host-device communication, and pushes CUDA details into the device array implementation.
If there is interest for it, we can additionally implement the resize functionality for the device array, as outlines in #4689 .
Should I change the PR in some way? I am grateful for any comments or suggestions!