[gpu] Add ability to download contiguous chunk of memory to host using `Device{Array,Memory}` #4741

FabianSchuetze · 2021-04-30T14:25:42Z

This commit tries to address #4689, in line with the comments received in #4677.

Users can now download an interval range of data from the device array to the host instead of downloading the entire device array. This provides users with greater flexibility when interacting with the device array, potentially speeds up host-device communication, and pushes CUDA details into the device array implementation.

If there is interest for it, we can additionally implement the resize functionality for the device array, as outlines in #4689 .

Should I change the PR in some way? I am grateful for any comments or suggestions!

kunaltyagi · 2021-04-30T14:40:00Z

gpu/containers/include/pcl/gpu/containers/impl/device_array.hpp

+  if (device_end_offset < device_begin_offset) {
+    return false;
+  }
+  const T* begin = ptr() + device_begin_offset;


Can we move this line to DeviceMemory as well? That'll keep DeviceMemory from downloading from a random location accidentally

Thanks for your comment!

I think this would be possible by altering the function signature of DeviceMemory::download. The DeviceMemory class acts as blob storage and does not know the type of elements it stores. Defining memory locations to sync within member functions is thus not feasible currently. However, if the API of the memory device was DeviceMemory::Download(host_ptr, device_begin_offset, device_end_offset, elem_size), we can include the lines you mentioned in the DeviceMemory class. However, I am not sure this is useful. Firstly, our cudaSafeCall macros wrapping Cuda interactions report errors, and users can guard themselves against this by using try-catch statements. Secondly, I did not see any PCL code using DeviceMemory directly. Instead, all interactions are through the DeviceArray class. What do you think about that?

users can guard themselves against this by using try-catch statements.

If there's a potentially safer API where such counter-measures aren't required on the user's end, then we should aim for that. PCL has a lot of baggage, let's not add another item for a future cleanup 😆

Secondly, I did not see any PCL code using DeviceMemory directly

It's public API. Even if we don't use it, someone might be using it. Another point is that if we can have a consistent API, then it's friendlier for downstream users.

Since DeviceMemory (DM) works on bytes, and not datatype T, we can have the following API to mirror the DeviceArray (DA) API:

DM::download(host_ptr, device_begin_byte_offset, device_end_byte_offset)

My comments are relevant mostly to the public API. The private/protected API can be more raw. Eg:

DA::download/3: check_offset/2 && DM::download_protected/3 DM::download/3: check_offset/2 && DM::download_protected/3 DM::download_protected_3: cudamemcpy/4 && cudaSync0

(please forgive the weird syntax used)

Thanks for your reply! Your suggestion for the DM::download signature is very useful and I will try to implement it - thank you!

Thanks again for your comments, @kunaltyagi! I tried to address your concerns, but I am not sure my implementation is proper. As the device memory is type agnostic, we cannot perform pointer arithmetic to calculate the position of the data to be downloaded. Instead, I cast the void* pointer to a char* to perform the required arithmetic. I am not sure if this is idiomatic code or if a better solution exists. What do you think about the implementation?

The cast to char* and back to void* is inevitable due to the existing code. Do you think we need to make the conversion to void* explicit (using static_cast)?

Rest LGTM (already approved)

Fantastic - thank you so much for your review, @kunaltyagi !

Update requested

kunaltyagi · 2021-05-05T11:57:48Z

I just checked the API. What are your thoughts on adding an equivalent upload alternative to keep things symmetrical?

PS: It's not in our current use-case (the Euclidean clustering PR)

FabianSchuetze · 2021-05-07T17:01:38Z

I just checked the API. What are your thoughts on adding an equivalent upload alternative to keep things symmetrical?

PS: It's not in our current use-case (the Euclidean clustering PR)

I think this is an intriguing idea! I began experimenting with it to judge how useful this would be (I hope to avoid some memory allocations on the device). I will soon reply to your question with a more educated opinion.

FabianSchuetze · 2021-05-10T18:23:14Z

I think having the upload functionality is a wonderful idea. The STL has the agnostic copy function, and I don't see any reason why our users should only be able to download parts of the device array but not upload to parts of it. So we introduce:
DeviceArray::upload(host_ptr, device_begin_offset, device_end_offset) ? That would be lovely!

We could even think of unifying these two function with a gpu::copy(source_begin_iterator, source_end_iterator, dest_begin_iterator) API. However, I think this would require us to introduce iterators because we need to deduce the direction of the copy. Alternatively, we could introduce gpu::copy(source_begin*, source_end*, dest_begin*, direction) API, similar to the cudaMemcpy API. Do you think such a copy function would be worthwhile? An advantage could be conforming to the STL API, at the cost of having "visual noise" because we retain the original upload/download function while introducing the copy function.

kunaltyagi · 2021-05-10T23:20:46Z

Iterator and copy function can be another PR post discussion (without the destination required, since iterators have that info)

FabianSchuetze · 2021-05-11T14:18:05Z

Thanks for your comments, @kunaltyagi! I have added the upload functionality and think this is a good addition - thanks for the suggestion! I have not factored the functionality in device_memory.cpp in a private function because this hasn't been done for the other upload/download functions. However, I am happy to do so if you think this is preferred.

FabianSchuetze · 2021-05-11T18:22:53Z

Thanks a lot for reviewing and approving the changes @kunaltyagi! It was again great fun to work on this PR. Just understand how to proceed: Should I merge this branch into the branch used for #4677 to address Lars' comments? I am not used to working on such large projects and sometimes get confused about dealing with different branches.

kunaltyagi · 2021-05-11T23:02:30Z

No. This will get merged first. Then you can rebase your older PR on top of the new master

mvieth · 2021-06-06T19:39:29Z

Sorry for the delay.

Your proposal uses begin and end offsets to specify the upload/download region. An alternative would be the begin offset and the number of elements. That would make the device_end_byte_offset < device_begin_byte_offset check unnecessary, and would work a bit more nicely with cudaMemcpy. Which option do you think would be more practical to use, and which one would fit better for the GPU clustering that you are also working on? (Sorry if there already was a discussion about this and I didn't see it)
The other upload function calls create internally - do you think that should also be the case for the new upload function? If no but there are still some assumptions the upload function makes/preconditions that must be fulfilled, that should appear in the documentation

kunaltyagi · 2021-06-07T00:50:03Z

I missed the create call. And the begin_idx and (unsigned) num_elements sounds like a nice interface :)

FabianSchuetze · 2021-06-09T17:08:53Z

Thank you, @mvieth and @kunaltyagi, for your comments! These are very helpful indeed - I will think about them and write a more detailed response!

FabianSchuetze · 2021-06-11T10:54:15Z

Thanks for your comments, Markus! These prompted me to reflect and I think I can address both now:

Ad 1:
I think the first question can be split into two parts. One concerns the public API of DeviceArray<T>::download/upload (I use download for concreteness below), while the other relates to checking the validity of the passed arguments.

i) The suggested public API is download(T* host_begin, std::size_t device_begin_offset, std::size_t device_end_offset). I would have preferred to use iterators instead and write
download(host_iterator begin, device_iterator begin, device_iterator end) because such API bears a closer resemblance to the std and thrust containers. As we don't have iterators yet (I'd love to work on it once other PRs got merged or refuted), I considered the proposed API the second best. Nevertheless, I'm totally fine to convert it to download(T* host_begin, std::size_t device_begin_offset, std::size_t num_elements).

ii) I think we could check early in DeviceArray<T>::download if the passed begin and end (or size) arguments are indeed feasible - that is probably a good idea!

Ad 2:
I agree that I should document the difference between both upload functions better! The new upload function intentionally avoids memory allocations for performance increases.

Does that address your questions, Markus? I hope I understood them correctly!

mvieth · 2021-06-11T13:38:34Z

1.i) Start+end iterators (and as the next best thing, start+end pointers) are not really used elsewhere in DeviceMemory and DeviceArray, so they would be kind of an exception for the new download/upload. Similarity to stdlib and thrust is IMO not strictly necessary, and the start+count concept would even be similar to copy_n and memcpy. Taking the GPU clustering as an example, when we want to copy sections of data, we know where they start and how many elements (indices) we want to copy (as read from sizes). We do not know the end pointer directly, we have to compute it first, and in the download function, we have to do the inverse calculation for cudaMemcpy. For these reasons and the others above it think start+count would be slightly more practical than start+end
1.ii) The DeviceMemory functions should check that too, though, because users can use DeviceMemory directly. And checking it twice is unnecessary, so I think leaving the checks in the DeviceMemory functions is fine.
For more complete checks, we should make sure that the end is not outside of the memory (in addition to the check that start comes before end) for the start+end concept. Alternatively, for the start+count concept, we would only have to check that start+count is less or equal to the total number of bytes/elements.
2) If it is properly documented, then it is fine by me

FabianSchuetze · 2021-06-11T14:43:37Z

Thank you, Markus, for your detailed comments. All this sounds convincing to me, and I'll happily implement that - thanks!

mvieth

Here are some comments on the new changes, mostly minor stuff

mvieth · 2021-06-13T18:36:20Z

gpu/containers/src/device_memory.cpp

+  const void* const begin = static_cast<char*>(data_) + device_begin_byte_offset;
+  const char* const download_end = static_cast<const char*>(begin) + num_bytes;
+  const char* const array_end = static_cast<char*>(data_) + sizeBytes_;
+  if (download_end > array_end) {


Correct me if I'm wrong, but I think (device_begin_byte_offset + num_bytes) > sizeBytes_ should have the same effect, and is IMO a bit more readable (same for upload)

This is great! I think you are correct, and it's much more readable. I will think about its correctness a bit more and apply the changes. Thanks for the suggestion!

This was an excellent comment - thank you, Markus! Your remark also made me realize that the "const-ness" of the function arguments was wrong. I hope/think it is correct now. Again, thank you!

mvieth · 2021-06-13T18:37:25Z

gpu/containers/src/device_memory.cpp

+  if (upload_end > array_end) {
+      return false;
+  }
+  cudaSafeCall(cudaMemcpy(host_ptr_arg, begin, num_bytes, cudaMemcpyHostToDevice));


I think host_ptr_arg and begin have to be switched (compare other upload function)

Hm..., I have to think about that but at first glance I would say you are right... Strange that this didn't result in an error when I thought I tested it. Thanks for pointing this out!

mvieth · 2021-06-13T18:39:22Z

gpu/containers/include/pcl/gpu/containers/device_memory.h

+            /** \brief Uploads data from CPU memory to device array. Please note
+             * that this overload never allocates memory in contrast to the
+             * other upload function.
+             * Returns true if upload successfull


Suggested change

* Returns true if upload successfull

* Returns true if upload successful

Once more below, also consider doxygen's \return tag

This should be fixed - thanks for the comment!

gpu/containers/include/pcl/gpu/containers/device_array.h

FabianSchuetze · 2021-06-15T17:07:03Z

Thanks for your comments, Markus and Kunal! They helped me a lot. Your remarks about Doxygen highlighted the disparity between return types in the existing upload/download functions and new overloads: The current functions return void, but the new functions return a bool. Should we harmonize this? If so, how?

mvieth · 2021-06-15T20:33:17Z

Since the formatting pull request got merged, there are a few conflicts. Could you resolve them please?

kunaltyagi · 2021-06-15T23:40:56Z

Should we harmonize this? If so, how?

Make old functions return bool as well :)

FabianSchuetze · 2021-06-16T17:15:08Z

Thanks for the info, Markus! I re-based the branch, and I think it's ready for a merge.

I initially expected to add a bool return for the old functions, too. However, I don't see good conditions to check (other than maybe inspecting the Cuda error flags, which is a duplication as they are noisy anyway). What do you think about that?

kunaltyagi · 2021-06-17T04:00:29Z

return true? 😆

…g `Device{Array,Memory}` (PointCloudLibrary#4741)

kunaltyagi reviewed Apr 30, 2021

View reviewed changes

kunaltyagi previously approved these changes May 5, 2021

View reviewed changes

kunaltyagi added changelog: enhancement Meta-information for changelog generation module: gpu labels May 5, 2021

kunaltyagi changed the title ~~Expand device array api~~ [gpu] Add ability to download contiguous chunk of memory using Device{Array,Memory} May 5, 2021

kunaltyagi changed the title ~~[gpu] Add ability to download contiguous chunk of memory using Device{Array,Memory}~~ [gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory} May 5, 2021

kunaltyagi previously approved these changes May 11, 2021

View reviewed changes

FabianSchuetze mentioned this pull request May 15, 2021

[GPU::DeviceArray] Subscript Operators and Iterators to facilitate leveraging GPU implementations #4759

Closed

kunaltyagi added the needs: code review Specify why not closed/merged yet label Jun 1, 2021

kunaltyagi requested a review from mvieth June 1, 2021 04:12

FabianSchuetze dismissed kunaltyagi’s stale review via 63a58ee June 12, 2021 14:56

mvieth reviewed Jun 13, 2021

View reviewed changes

kunaltyagi reviewed Jun 13, 2021

View reviewed changes

gpu/containers/include/pcl/gpu/containers/device_array.h Outdated Show resolved Hide resolved

FabianSchuetze added 10 commits June 16, 2021 17:46

remove costly memory allocation

232b2e8

try to expand device_array api

e31df98

corrections to run correctly

5ef6cb9

remove wrong commit

fa7d3e1

remove wrong commit

184a7fd

pushed check to device memory

d249a5c

clean up

d342eb4

added upload functionality

0feaba8

try to address PR feedback

06f8457

doxygen and constness

518503b

FabianSchuetze force-pushed the expand_device_array_api branch from 5e8ca77 to 518503b Compare June 16, 2021 17:04

formatting error

b3bbb0e

kunaltyagi approved these changes Jun 17, 2021

View reviewed changes

larshg approved these changes Jun 17, 2021

View reviewed changes

kunaltyagi approved these changes Jun 18, 2021

View reviewed changes

kunaltyagi merged commit 22cc0c6 into PointCloudLibrary:master Jun 18, 2021

kunaltyagi removed the needs: code review Specify why not closed/merged yet label Jun 18, 2021

tin1254 pushed a commit to tin1254/pcl that referenced this pull request Aug 10, 2021

[gpu] Add ability to download contiguous chunk of memory to host usin…

16d5cc6

…g `Device{Array,Memory}` (PointCloudLibrary#4741)

mvieth pushed a commit to mvieth/pcl that referenced this pull request Dec 27, 2021

[gpu] Add ability to download contiguous chunk of memory to host usin…

de46b19

…g `Device{Array,Memory}` (PointCloudLibrary#4741)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu] Add ability to download contiguous chunk of memory to host using `Device{Array,Memory}` #4741

[gpu] Add ability to download contiguous chunk of memory to host using `Device{Array,Memory}` #4741

FabianSchuetze commented Apr 30, 2021 •

edited

Loading

kunaltyagi Apr 30, 2021 •

edited

Loading

FabianSchuetze May 1, 2021

kunaltyagi May 2, 2021

FabianSchuetze May 2, 2021

FabianSchuetze May 3, 2021

kunaltyagi May 5, 2021 •

edited

Loading

FabianSchuetze May 7, 2021

kunaltyagi commented May 5, 2021

FabianSchuetze commented May 7, 2021

FabianSchuetze commented May 10, 2021

kunaltyagi commented May 10, 2021 •

edited

Loading

FabianSchuetze commented May 11, 2021

FabianSchuetze commented May 11, 2021

kunaltyagi commented May 11, 2021

mvieth commented Jun 6, 2021

kunaltyagi commented Jun 7, 2021

FabianSchuetze commented Jun 9, 2021 •

edited

Loading

FabianSchuetze commented Jun 11, 2021

mvieth commented Jun 11, 2021

FabianSchuetze commented Jun 11, 2021

mvieth left a comment

mvieth Jun 13, 2021

FabianSchuetze Jun 14, 2021

FabianSchuetze Jun 15, 2021

mvieth Jun 13, 2021

FabianSchuetze Jun 14, 2021

mvieth Jun 13, 2021

FabianSchuetze Jun 14, 2021

FabianSchuetze Jun 15, 2021

FabianSchuetze commented Jun 15, 2021

mvieth commented Jun 15, 2021

kunaltyagi commented Jun 15, 2021

FabianSchuetze commented Jun 16, 2021

kunaltyagi commented Jun 17, 2021

	* Returns true if upload successfull
	* Returns true if upload successful

[gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory} #4741

[gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory} #4741

Conversation

FabianSchuetze commented Apr 30, 2021 • edited Loading

kunaltyagi Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunaltyagi May 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunaltyagi commented May 5, 2021

FabianSchuetze commented May 7, 2021

FabianSchuetze commented May 10, 2021

kunaltyagi commented May 10, 2021 • edited Loading

FabianSchuetze commented May 11, 2021

FabianSchuetze commented May 11, 2021

kunaltyagi commented May 11, 2021

mvieth commented Jun 6, 2021

kunaltyagi commented Jun 7, 2021

FabianSchuetze commented Jun 9, 2021 • edited Loading

FabianSchuetze commented Jun 11, 2021

mvieth commented Jun 11, 2021

FabianSchuetze commented Jun 11, 2021

mvieth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FabianSchuetze commented Jun 15, 2021

mvieth commented Jun 15, 2021

kunaltyagi commented Jun 15, 2021

FabianSchuetze commented Jun 16, 2021

kunaltyagi commented Jun 17, 2021

[gpu] Add ability to download contiguous chunk of memory to host using `Device{Array,Memory}` #4741

[gpu] Add ability to download contiguous chunk of memory to host using `Device{Array,Memory}` #4741

FabianSchuetze commented Apr 30, 2021 •

edited

Loading

kunaltyagi Apr 30, 2021 •

edited

Loading

kunaltyagi May 5, 2021 •

edited

Loading

kunaltyagi commented May 10, 2021 •

edited

Loading

FabianSchuetze commented Jun 9, 2021 •

edited

Loading