Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Support future value for initial value for device scan #305

Merged
merged 1 commit into from
Oct 15, 2021

Conversation

zasdfgbnm
Copy link
Contributor

@zasdfgbnm zasdfgbnm commented May 12, 2021

Prototyping the support of device pointer for cub device scan.

Example Usage:

cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, cub::FutureValue<float>(ptr), num_items);

Tests pass for this PR, but there is no documentation yet. I will add doc if @allisonvacanti thinks this is a good idea.

Please review NVIDIA/thrust#1519 for thrust change.

@zasdfgbnm zasdfgbnm changed the title [Discussion] Support device pointer for initial value [Discussion] Support device pointer for initial value for device scan May 12, 2021
@alliepiper
Copy link
Collaborator

I think this idea is very useful, and could actually be more general than just InitialValue usecases. I haven't reviewed the implementation too closely yet, but some initial impressions/thoughts:

  1. InitValueFromDevicePointer is too narrow of a name. This utility behaves similarly to a simplified std::future, so maybe something like cub::SimpleFutureValue would be more suitable?
  2. The current implementation lacks a lot of the safety features of a traditional future -- in particular, there's no way to check if the value has actually been set. This is why I'm thinking of calling it a SimpleFutureValue instead of just FutureValue.
  3. We /could/ flesh out the API into a more complete future implementation, but I'm not sure if this is really valuable here. The usecase of "I'm computing a value at this address in a previous stream-ordered kernel" wouldn't benefit from the additional safety checks since synchronization is handled externally. We'd just need to carefully document this.
  4. We should look through the other device algorithms and see where else this may be useful. We'll need to explicitly document which algorithms support using this utility.

@jrhemstad @brycelelbach @griwes @dkolsen-pgi @dumerrill @canonizer -- Any thoughts on this? Alternative names, possible issues, etc? There's some more context in #294.

@zasdfgbnm
Copy link
Contributor Author

zasdfgbnm commented May 12, 2021

I think it can be even more general than just "a future on stream order wrapping a device pointer". We can make it a lazy value, something like:

InitialValue *ptr = ....;
cub::LazyValue<InitialValue>([=]() __device__ { return *ptr; });

or even

int *semaphore = ...;
InitialValue *ptr = ....;
cub::LazyValue<InitialValue>(
    [=]() __device__ { 
        while (*semaphore < 1) __nanosleep(8);
        return *ptr;
    }
);

@elstehle
Copy link
Collaborator

elstehle commented May 12, 2021

Would it make sense to just take a single-item input iterator type? I would consider that the most flexible: it would allow you to also wrap it in a TransformInputIterator, if needed. E.g., if your yet-to-be-computed-value does not map 1-1 to your actual init_value.

As for the changes, one could do something like this:
In the DeviceScan functions, to have two interfaces:

  1. one that takes an init_value_it (for the requested use case).
  2. one that takes an init_value (basically, the already existing Scan interface)

Now, (2) could wrap the init_value into a ConstantInputIterator and then invoke (1) - this is similar to what @jrhemstad had mentioned before.
From there on, just the implementation has to change to actually dereference init_value_it instead of using an init_value straight.

@zasdfgbnm
Copy link
Contributor Author

@elstehle Is there a reliable way to test if an object is an iterator? Your suggestion sounds like a good design, but it might be tricky to implement.

Also, I don't have a strong preference on how the method should be called, and how it should behave. The design suggestion from @allisonvacanti and @elstehle both look good to me. I am OK with whichever decision we made. But I do hope we don't spend too much time making decisions so that I can start using this feature early.

@alliepiper
Copy link
Collaborator

I do hope we don't spend too much time making decisions so that I can start using this feature early.

Unfortunately my time is very limited right now, so it may be a little while before I can spend too many cycles on this. The good news is that I'm getting some more help working on CUB, so I should have more time for this sort of stuff in the near future.

I'll keep this on my radar, but feel free to remind me if we don't address it in the next couple of weeks. I do see this as an important usecase, but I want to make sure we get it right.

@alliepiper
Copy link
Collaborator

alliepiper commented Jun 14, 2021

I was thinking about this some more over the weekend. I think we can get this into 1.14 -- here's what I'd like to see:

  • Name the wrapper utility cub::FutureValue<T>.
    • It just holds a device pointer.
    • No synchronization or safety checks; user must ensure that the result will be ready before use via external synchronization or stream-ordering dependencies.
    • Document these sharp edges well.
  • Consistency: All public Device level entry points that currently take immediate values instead of a pointer should have cub::FutureValue<T> overloads.
  • Low-overhead: We must not instantiate multiple kernels / implementations depending on whether a FutureValue or immediate value is provided.
    • The distinction should only exist in the public API layer.
    • This could be accomplished by having a wrapper that toggles between loading from a FutureValue or a stored value. This would be constructed by the public Device level functions, and passed into the Dispatch layer.

The wrapper in the last point could be as simple as:

template <typename T>
struct InputValue
{
  bool is_future;
  union
  {
    FutureValue<T> future_value;
    T immediate_value;
  };
  // Plus API to construct (host/device) and access value (device-only)
};

This would be passed from the Device entry point down to the kernel, where the actual access happens.

I think this would address my concerns about this feature. We can start with reduce or scan to work out any issues before porting the other algorithms in case there are surprises.

Does this sound reasonable?

@alliepiper
Copy link
Collaborator

alliepiper commented Jun 14, 2021

Just realized I'd neglected @elstehle suggestion for a more general iterator in the above. That complicates the implementation, but should be doable.

template <typename T, typename IterT = T*>
struct FutureValue
{
  using value_type = T;
  using iterator_type = IterT;
};

template <typename T, typename IterT = T*>
struct InputValue
{
  bool is_future;
  union
  {
    FutureValue<T, IterT> future_value;
    T immediate_value;
  };
};

We would incur extra instantiation penalties for IterT != T*, but that's unavoidable, and the common case of a pointer can still be rolled into the default instantiation.

I'm leaning toward using the InputValue approach over the ConstantInputIterator suggestion since we'd save on instantiations.

@zasdfgbnm
Copy link
Contributor Author

Hi @allisonvacanti, sorry for the late reply. This design looks good to me, I will ask if I have further questions. What is the timeline for 1.14 and 1.15? I am recently busy and I will schedule myself to try to catch the deadline of accepting PRs of 1.14 or 1.15 for this.

@alliepiper
Copy link
Collaborator

@zasdfgbnm The milestones are tracked here: https://github.com/NVIDIA/cub/milestones. Currently 1.14 features need to be merged by 8/18, and 1.15 should be complete by 10/18.

Are you planning to handle the implementation for this?

@zasdfgbnm
Copy link
Contributor Author

@allisonvacanti Yes, I will. But I am not sure if it will be 1.14 or 1.15.

@alliepiper
Copy link
Collaborator

Sounds good, I'll leave the milestone blank for now. Thanks!

@zasdfgbnm zasdfgbnm changed the title [Discussion] Support device pointer for initial value for device scan Support device pointer for initial value for device scan Sep 7, 2021
@zasdfgbnm zasdfgbnm changed the title Support device pointer for initial value for device scan Support future value for initial value for device scan Sep 7, 2021
@zasdfgbnm
Copy link
Contributor Author

Please see NVIDIA/thrust#1519 for thrust change.

@zasdfgbnm zasdfgbnm marked this pull request as ready for review September 7, 2021 21:09
@zasdfgbnm
Copy link
Contributor Author

@allisonvacanti This should be ready for review

Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I pointed out some minor issues in comments, but I'm very happy with how this is turning out :)

@jrhemstad @elstehle @senior-zero What do you think?

cub/device/dispatch/dispatch_scan.cuh Outdated Show resolved Hide resolved
cub/util_type.cuh Show resolved Hide resolved
cub/util_type.cuh Show resolved Hide resolved
cub/util_type.cuh Outdated Show resolved Hide resolved
cub/util_type.cuh Outdated Show resolved Hide resolved
@alliepiper alliepiper removed their assignment Sep 30, 2021
@alliepiper alliepiper added this to the 1.15.0 milestone Sep 30, 2021
Copy link
Contributor Author

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonvacanti I have resolved all your review comments.

@alliepiper alliepiper self-assigned this Oct 6, 2021
@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 6, 2021
@alliepiper
Copy link
Collaborator

DVS CL: 30512893
gpuCI: NVIDIA/thrust#1519

@alliepiper alliepiper added P1: should have Necessary, but not critical. helps: pytorch Helps or needed by PyTorch. and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 14, 2021
@alliepiper
Copy link
Collaborator

DVS CL: 30535270
gpuCI: NVIDIA/thrust#1519

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 14, 2021
@alliepiper alliepiper merged commit 5712619 into NVIDIA:main Oct 15, 2021
alliepiper pushed a commit to zasdfgbnm/thrust that referenced this pull request Oct 15, 2021
alliepiper pushed a commit to zasdfgbnm/thrust that referenced this pull request Oct 15, 2021
alliepiper added a commit to NVIDIA/thrust that referenced this pull request Oct 15, 2021
@zasdfgbnm zasdfgbnm deleted the device-pointer branch October 15, 2021 16:52
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Nov 30, 2021
…66711)

Summary:
NVIDIA/cub#305 has landed to cub 1.15. This is ready to review and land. This PR contains #66219, please land that PR first before review.

Pull Request resolved: #66711

Reviewed By: soulitzer

Differential Revision: D32698306

Pulled By: ngimel

fbshipit-source-id: 4cc6b9b24cefd8932f4d421c6d64ea20ea911f52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: pytorch Helps or needed by PyTorch. P1: should have Necessary, but not critical. testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants