Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory} #4741

Merged

Conversation

FabianSchuetze
Copy link
Contributor

@FabianSchuetze FabianSchuetze commented Apr 30, 2021

This commit tries to address #4689, in line with the comments received in #4677.

Users can now download an interval range of data from the device array to the host instead of downloading the entire device array. This provides users with greater flexibility when interacting with the device array, potentially speeds up host-device communication, and pushes CUDA details into the device array implementation.

If there is interest for it, we can additionally implement the resize functionality for the device array, as outlines in #4689 .

Should I change the PR in some way? I am grateful for any comments or suggestions!

if (device_end_offset < device_begin_offset) {
return false;
}
const T* begin = ptr() + device_begin_offset;
Copy link
Member

@kunaltyagi kunaltyagi Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this line to DeviceMemory as well? That'll keep DeviceMemory from downloading from a random location accidentally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment!

I think this would be possible by altering the function signature of DeviceMemory::download. The DeviceMemory class acts as blob storage and does not know the type of elements it stores. Defining memory locations to sync within member functions is thus not feasible currently. However, if the API of the memory device was DeviceMemory::Download(host_ptr, device_begin_offset, device_end_offset, elem_size), we can include the lines you mentioned in the DeviceMemory class. However, I am not sure this is useful. Firstly, our cudaSafeCall macros wrapping Cuda interactions report errors, and users can guard themselves against this by using try-catch statements. Secondly, I did not see any PCL code using DeviceMemory directly. Instead, all interactions are through the DeviceArray class. What do you think about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users can guard themselves against this by using try-catch statements.

If there's a potentially safer API where such counter-measures aren't required on the user's end, then we should aim for that. PCL has a lot of baggage, let's not add another item for a future cleanup 😆

Secondly, I did not see any PCL code using DeviceMemory directly

It's public API. Even if we don't use it, someone might be using it. Another point is that if we can have a consistent API, then it's friendlier for downstream users.

Since DeviceMemory (DM) works on bytes, and not datatype T, we can have the following API to mirror the DeviceArray (DA) API:

DM::download(host_ptr, device_begin_byte_offset, device_end_byte_offset)

My comments are relevant mostly to the public API. The private/protected API can be more raw. Eg:

DA::download/3: check_offset/2 && DM::download_protected/3
DM::download/3: check_offset/2 && DM::download_protected/3
DM::download_protected_3: cudamemcpy/4 && cudaSync0

(please forgive the weird syntax used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply! Your suggestion for the DM::download signature is very useful and I will try to implement it - thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your comments, @kunaltyagi! I tried to address your concerns, but I am not sure my implementation is proper. As the device memory is type agnostic, we cannot perform pointer arithmetic to calculate the position of the data to be downloaded. Instead, I cast the void* pointer to a char* to perform the required arithmetic. I am not sure if this is idiomatic code or if a better solution exists. What do you think about the implementation?

Copy link
Member

@kunaltyagi kunaltyagi May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast to char* and back to void* is inevitable due to the existing code. Do you think we need to make the conversion to void* explicit (using static_cast)?

Rest LGTM (already approved)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic - thank you so much for your review, @kunaltyagi !

kunaltyagi
kunaltyagi previously approved these changes May 5, 2021
@kunaltyagi kunaltyagi added changelog: enhancement Meta-information for changelog generation module: gpu labels May 5, 2021
@kunaltyagi kunaltyagi changed the title Expand device array api [gpu] Add ability to download contiguous chunk of memory using Device{Array,Memory} May 5, 2021
@kunaltyagi kunaltyagi changed the title [gpu] Add ability to download contiguous chunk of memory using Device{Array,Memory} [gpu] Add ability to download contiguous chunk of memory to host using Device{Array,Memory} May 5, 2021
@kunaltyagi kunaltyagi dismissed their stale review May 5, 2021 11:55

Update requested

@kunaltyagi
Copy link
Member

I just checked the API. What are your thoughts on adding an equivalent upload alternative to keep things symmetrical?

PS: It's not in our current use-case (the Euclidean clustering PR)

@FabianSchuetze
Copy link
Contributor Author

I just checked the API. What are your thoughts on adding an equivalent upload alternative to keep things symmetrical?

PS: It's not in our current use-case (the Euclidean clustering PR)

I think this is an intriguing idea! I began experimenting with it to judge how useful this would be (I hope to avoid some memory allocations on the device). I will soon reply to your question with a more educated opinion.

@FabianSchuetze
Copy link
Contributor Author

I think having the upload functionality is a wonderful idea. The STL has the agnostic copy function, and I don't see any reason why our users should only be able to download parts of the device array but not upload to parts of it. So we introduce:
DeviceArray::upload(host_ptr, device_begin_offset, device_end_offset) ? That would be lovely!

We could even think of unifying these two function with a gpu::copy(source_begin_iterator, source_end_iterator, dest_begin_iterator) API. However, I think this would require us to introduce iterators because we need to deduce the direction of the copy. Alternatively, we could introduce gpu::copy(source_begin*, source_end*, dest_begin*, direction) API, similar to the cudaMemcpy API. Do you think such a copy function would be worthwhile? An advantage could be conforming to the STL API, at the cost of having "visual noise" because we retain the original upload/download function while introducing the copy function.

@kunaltyagi
Copy link
Member

kunaltyagi commented May 10, 2021

Iterator and copy function can be another PR post discussion (without the destination required, since iterators have that info)

@FabianSchuetze
Copy link
Contributor Author

Thanks for your comments, @kunaltyagi! I have added the upload functionality and think this is a good addition - thanks for the suggestion! I have not factored the functionality in device_memory.cpp in a private function because this hasn't been done for the other upload/download functions. However, I am happy to do so if you think this is preferred.

kunaltyagi
kunaltyagi previously approved these changes May 11, 2021
@FabianSchuetze
Copy link
Contributor Author

Thanks a lot for reviewing and approving the changes @kunaltyagi! It was again great fun to work on this PR. Just understand how to proceed: Should I merge this branch into the branch used for #4677 to address Lars' comments? I am not used to working on such large projects and sometimes get confused about dealing with different branches.

@kunaltyagi
Copy link
Member

No. This will get merged first. Then you can rebase your older PR on top of the new master

@mvieth
Copy link
Member

mvieth commented Jun 6, 2021

Sorry for the delay.

  1. Your proposal uses begin and end offsets to specify the upload/download region. An alternative would be the begin offset and the number of elements. That would make the device_end_byte_offset < device_begin_byte_offset check unnecessary, and would work a bit more nicely with cudaMemcpy. Which option do you think would be more practical to use, and which one would fit better for the GPU clustering that you are also working on? (Sorry if there already was a discussion about this and I didn't see it)
  2. The other upload function calls create internally - do you think that should also be the case for the new upload function? If no but there are still some assumptions the upload function makes/preconditions that must be fulfilled, that should appear in the documentation

@kunaltyagi
Copy link
Member

I missed the create call. And the begin_idx and (unsigned) num_elements sounds like a nice interface :)

@FabianSchuetze
Copy link
Contributor Author

FabianSchuetze commented Jun 9, 2021

Thank you, @mvieth and @kunaltyagi, for your comments! These are very helpful indeed - I will think about them and write a more detailed response!

@FabianSchuetze
Copy link
Contributor Author

Thanks for your comments, Markus! These prompted me to reflect and I think I can address both now:

Ad 1:
I think the first question can be split into two parts. One concerns the public API of DeviceArray<T>::download/upload (I use download for concreteness below), while the other relates to checking the validity of the passed arguments.

i) The suggested public API is download(T* host_begin, std::size_t device_begin_offset, std::size_t device_end_offset). I would have preferred to use iterators instead and write
download(host_iterator begin, device_iterator begin, device_iterator end) because such API bears a closer resemblance to the std and thrust containers. As we don't have iterators yet (I'd love to work on it once other PRs got merged or refuted), I considered the proposed API the second best. Nevertheless, I'm totally fine to convert it to download(T* host_begin, std::size_t device_begin_offset, std::size_t num_elements).

ii) I think we could check early in DeviceArray<T>::download if the passed begin and end (or size) arguments are indeed feasible - that is probably a good idea!

Ad 2:
I agree that I should document the difference between both upload functions better! The new upload function intentionally avoids memory allocations for performance increases.

Does that address your questions, Markus? I hope I understood them correctly!

@mvieth
Copy link
Member

mvieth commented Jun 11, 2021

1.i) Start+end iterators (and as the next best thing, start+end pointers) are not really used elsewhere in DeviceMemory and DeviceArray, so they would be kind of an exception for the new download/upload. Similarity to stdlib and thrust is IMO not strictly necessary, and the start+count concept would even be similar to copy_n and memcpy. Taking the GPU clustering as an example, when we want to copy sections of data, we know where they start and how many elements (indices) we want to copy (as read from sizes). We do not know the end pointer directly, we have to compute it first, and in the download function, we have to do the inverse calculation for cudaMemcpy. For these reasons and the others above it think start+count would be slightly more practical than start+end
1.ii) The DeviceMemory functions should check that too, though, because users can use DeviceMemory directly. And checking it twice is unnecessary, so I think leaving the checks in the DeviceMemory functions is fine.
For more complete checks, we should make sure that the end is not outside of the memory (in addition to the check that start comes before end) for the start+end concept. Alternatively, for the start+count concept, we would only have to check that start+count is less or equal to the total number of bytes/elements.
2) If it is properly documented, then it is fine by me

@FabianSchuetze
Copy link
Contributor Author

Thank you, Markus, for your detailed comments. All this sounds convincing to me, and I'll happily implement that - thanks!

Copy link
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments on the new changes, mostly minor stuff

const void* const begin = static_cast<char*>(data_) + device_begin_byte_offset;
const char* const download_end = static_cast<const char*>(begin) + num_bytes;
const char* const array_end = static_cast<char*>(data_) + sizeBytes_;
if (download_end > array_end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I think (device_begin_byte_offset + num_bytes) > sizeBytes_ should have the same effect, and is IMO a bit more readable (same for upload)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think you are correct, and it's much more readable. I will think about its correctness a bit more and apply the changes. Thanks for the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an excellent comment - thank you, Markus! Your remark also made me realize that the "const-ness" of the function arguments was wrong. I hope/think it is correct now. Again, thank you!

if (upload_end > array_end) {
return false;
}
cudaSafeCall(cudaMemcpy(host_ptr_arg, begin, num_bytes, cudaMemcpyHostToDevice));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think host_ptr_arg and begin have to be switched (compare other upload function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm..., I have to think about that but at first glance I would say you are right... Strange that this didn't result in an error when I thought I tested it. Thanks for pointing this out!

/** \brief Uploads data from CPU memory to device array. Please note
* that this overload never allocates memory in contrast to the
* other upload function.
* Returns true if upload successfull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Returns true if upload successfull
* Returns true if upload successful

Once more below, also consider doxygen's \return tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed - thanks for the comment!

@FabianSchuetze
Copy link
Contributor Author

Thanks for your comments, Markus and Kunal! They helped me a lot. Your remarks about Doxygen highlighted the disparity between return types in the existing upload/download functions and new overloads: The current functions return void, but the new functions return a bool. Should we harmonize this? If so, how?

@mvieth
Copy link
Member

mvieth commented Jun 15, 2021

Since the formatting pull request got merged, there are a few conflicts. Could you resolve them please?

@kunaltyagi
Copy link
Member

Should we harmonize this? If so, how?

Make old functions return bool as well :)

@FabianSchuetze FabianSchuetze force-pushed the expand_device_array_api branch from 5e8ca77 to 518503b Compare June 16, 2021 17:04
@FabianSchuetze
Copy link
Contributor Author

Thanks for the info, Markus! I re-based the branch, and I think it's ready for a merge.

I initially expected to add a bool return for the old functions, too. However, I don't see good conditions to check (other than maybe inspecting the Cuda error flags, which is a duplication as they are noisy anyway). What do you think about that?

@kunaltyagi
Copy link
Member

return true? 😆

@kunaltyagi kunaltyagi merged commit 22cc0c6 into PointCloudLibrary:master Jun 18, 2021
@kunaltyagi kunaltyagi removed the needs: code review Specify why not closed/merged yet label Jun 18, 2021
tin1254 pushed a commit to tin1254/pcl that referenced this pull request Aug 10, 2021
mvieth pushed a commit to mvieth/pcl that referenced this pull request Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: enhancement Meta-information for changelog generation module: gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants