-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104
Conversation
…echanisms to SYCL Signed-off-by: James Brodman <james.brodman@intel.com>
Thanks for posting this. I have a few questions:
With regard to this I read through the proposed clarification to Peer access rules in SYCL 2020 next here: gmlueck/SYCL-Docs@76e1b44 ; it seems that references to "migratable usm" have been removed. I didn't read it thoroughly but I didn't find an explicit mention of peer to peer USM copies (as opposed to Peer to Peer access). cc @gmlueck: Is this intentional?
The simplest way to leverage Peer memory in this case is to allow a direct memory copy from device_a to device_b (allowing it only when the devices share a context). The implementation could be very similar to this scrapped implementation for a similar peer to peer copy for devices in different contexts (now deemed not allowed): #4401
Thanks |
|
Signed-off-by: James Brodman <james.brodman@intel.com>
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
Signed-off-by: James Brodman <james.brodman@intel.com>
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
it indicates that this device may perform atomic operationson USM device memory | ||
allocations located on the `peer` device when peer access is enabled to that | ||
device. If the query returns false, attempting to perform atomic operations on | ||
`peer` memory will have undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core SYCL spec makes a distinction between "atomic operations" and "concurrent access". The Level Zero driver has separate queries for these two concepts. We need to clarify what atomics_supported
means. I think it should mean that both atomic operations and concurrent access is supported, which is consistent with the current wording in the SYCL spec for the usm_atomic_shared_allocations
aspect.
This is an area we are debating in general, though, so we may end up making two different queries for these concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think concurrent access comes into play here - I think it's only (pseudocode) atomicAdd(ptr, val)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic operations only make sense if two things can access the memory concurrently. I guess there are two possible interpretations for what atomics_supported
means:
-
This device and
peer
device can concurrently access the device USM and do atomic operations on that memory. These operations are atomic w.r.t. code running on the two devices. -
This device can access device USM from
peer
, but it cannot access it concurrently withpeer
. Atomic operations are supported, but only between work-items running on this device.
I was originally thinking the query meant (1), but your comment makes me think that maybe you intend (2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that we should pay attention to here is the concept of memory scope.
If the device and peer
can use atomics to concurrently update the same memory, then both devices will need to list memory_scope::system
in info::device::atomic_memory_scope_capabilities
. Both devices will need to use atomics with memory_scope::system
when concurrently accessing the memory to avoid a data race.
If the device is only accessing peer
's memory atomically but not concurrently with peer
, it can use atomics with memory_scope::device
. If peer
accesses the same memory concurrently, that's a data race.
I don't know whether it's better to use the atomics & concurrent distinction or to work in some concept of scope, but I agree with Greg that this needs to clarify exactly what is guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to add a new extended memory scope like memory_scope::ext_oneapi_peer_devices
?
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
I think this looks quite good from the point of view of the cuda backend (apart from the one issue I describe below). I can try a simple implementation to make sure there are no other issues with CUDA implementing this. There is one point that I'd like to clarify: Currently some backends (e.g. level_zero is implemented already) can do direct P2P copies for buffers. I think we should consider whether it is required (or not) for users to call In the CUDA backend in order to enable P2P copy (as well as enable P2P access) of memory from one device to another it would be necessary to call This means that if we want to disentangle the buffer P2P optimization from the USM P2P access feature, when the runtime does a P2P buffer copy we would need to have an implementation of the buffer P2P copy optimization in the CUDA backend do:
I think this would mean that in order to ensure that the (buffer) enabled peer to peer access doesn't interfere with a users usage of this USM extension (via the expected result of Perhaps this is only an issue for the CUDA backend?, but this sounds pretty messy already, and I think it could be a good idea to avoid these issues by connecting this USM extension with the expected behavior of any buffer P2P copy optimization such that The user is required to call What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend? |
This extension adds support for mechanisms to query and enable support for | ||
direct memory access between peer devices in a system. | ||
In particular, this allows one device to directly access USM Device | ||
allocations for a peer device in the same context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If two devices with P2P capabilities are placed in the same context, shouldn't this be implicitly enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been a lot of discussion about what a context means. I think our current consensus is that it does not provide any guarantee about P2P access between devices. Therefore, placing two devices in the same context does not provide any guarantee that USM memory allocated for one of those devices is accessible from another device in that same context.
See the discussion in internal Khronos issue 563.
Two comments:
|
Sure we could deal with that in the cuda backend if we can infer the location of the pointers provided to We would need to be able to know this so that we could move from the current cuda implementation of Now if we did this note that this would not completely remove interactions between buffer P2P and USM P2P, although the interactions that remain I think would only be for rather unusual use cases for which we need to consider the limitations of the number of active (active means cuCtxEnablePeerAccess has been called and returned without error) peer connections: For CUDA devices this maximum number of peers is set by either:
So if we had a hypothetical system of 9 non NVSwitch peers, 8 active connections for a single device for buffer P2P, then any user calls to I guess that there is a similar max peer constraint for level_zero?
My testing experience has been that execution times of For completeness the other thing to mention for the cuda case is that, unlike the corresponding level_zero case, peer access is granted between cuContexts rather than cuDevices: see the declaration of As an aside this is another small issue motivating why we are interested in the context questions that still don't appear to have clear answers: For the purposes of easily testing the performance of
This works because sycl::queue knows about sycl::context. Again I'm not suggesting the Device member function API change at all; rather I'm providing this information to try to represent the cuda peer functionality, and point out how it interacts with sycl::device, queue, and context concepts. |
Signed-off-by: James Brodman <james.brodman@intel.com>
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
@jbrodman @gmlueck Here are some small issues I came across:
I will add some corresponding tests to exhibit all this functionality soon and link you them. I thought I should just let you know what I found from my investigations. Apart from the fact that |
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
It could be good to include or link to example usage in this doc: something based on/ similar to:
And then another example for peer access (the above is an example of peer copy). Also I have apparently discovered that P2P via nvlink with Nvidia hardware is bi-directional, such that |
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Signed-off-by: James Brodman <james.brodman@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple spelling mistakes.
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc
Outdated
Show resolved
Hide resolved
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just to confirm so I can update the implementation:
|
Signed-off-by: James Brodman <james.brodman@intel.com>
namespace oneapi { | ||
enum class peer_access { | ||
access_supported, | ||
access_enabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
access_enabled
was removed below, but not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
This implements the current extension doc from #6104 in the CUDA backend only. Fixes #7543. Fixes #6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
...mechanisms to SYCL
Signed-off-by: James Brodman james.brodman@intel.com