[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

jbrodman · 2022-05-05T19:26:00Z

...mechanisms to SYCL

Signed-off-by: James Brodman james.brodman@intel.com

…echanisms to SYCL Signed-off-by: James Brodman <james.brodman@intel.com>

sycl/doc/extensions/proposed/sycl_ext_oneapi_p2p.asciidoc

JackAKirk · 2022-05-09T17:04:04Z

Thanks for posting this. I have a few questions:

Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b?
Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date?

With regard to this I read through the proposed clarification to Peer access rules in SYCL 2020 next here: gmlueck/SYCL-Docs@76e1b44 ; it seems that references to "migratable usm" have been removed. I didn't read it thoroughly but I didn't find an explicit mention of peer to peer USM copies (as opposed to Peer to Peer access). cc @gmlueck: Is this intentional?

Do you have plans to support P2P for buffers:
With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b. E.g. when using:

myQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * -1; });
    });
    myQueue.wait();

    otherQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D_2nd>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * 10; });
    });

The simplest way to leverage Peer memory in this case is to allow a direct memory copy from device_a to device_b (allowing it only when the devices share a context). The implementation could be very similar to this scrapped implementation for a similar peer to peer copy for devices in different contexts (now deemed not allowed): #4401

Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model?

Thanks

jbrodman · 2022-05-19T19:29:46Z

Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b?
Yes - a USM device pointer allocated on device_b will now be dereferenceable in kernels on device_a.

Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date?
That seems like an implementation detail. There's nothing that says the SYCL RT can't do this today.

Do you have plans to support P2P for buffers:
With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b.
Not at this time.

Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model?
No. P2P type capabilities better align with USM Device Memory. As per [SYCL] Using NoSignedWrap decoration without declaring SPV_KHR_no_integer_wrap_decoration #2, I think a RT could do optimized peer copies using the existing copy methods.

Signed-off-by: James Brodman <james.brodman@intel.com>

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

Signed-off-by: James Brodman <james.brodman@intel.com>

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

gmlueck · 2022-09-08T22:04:01Z

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

+it indicates that this device may perform atomic operationson USM device memory
+allocations located on the `peer` device when peer access is enabled to that
+device. If the query returns false, attempting to perform atomic operations on
+`peer` memory will have undefined behavior.


The core SYCL spec makes a distinction between "atomic operations" and "concurrent access". The Level Zero driver has separate queries for these two concepts. We need to clarify what atomics_supported means. I think it should mean that both atomic operations and concurrent access is supported, which is consistent with the current wording in the SYCL spec for the usm_atomic_shared_allocations aspect.

This is an area we are debating in general, though, so we may end up making two different queries for these concepts.

I don't think concurrent access comes into play here - I think it's only (pseudocode) atomicAdd(ptr, val)

Atomic operations only make sense if two things can access the memory concurrently. I guess there are two possible interpretations for what atomics_supported means:

This device and peer device can concurrently access the device USM and do atomic operations on that memory. These operations are atomic w.r.t. code running on the two devices.

This device can access device USM from peer, but it cannot access it concurrently with peer. Atomic operations are supported, but only between work-items running on this device.

I was originally thinking the query meant (1), but your comment makes me think that maybe you intend (2)?

Another thing that we should pay attention to here is the concept of memory scope.

If the device and peer can use atomics to concurrently update the same memory, then both devices will need to list memory_scope::system in info::device::atomic_memory_scope_capabilities. Both devices will need to use atomics with memory_scope::system when concurrently accessing the memory to avoid a data race.

If the device is only accessing peer's memory atomically but not concurrently with peer, it can use atomics with memory_scope::device. If peer accesses the same memory concurrently, that's a data race.

I don't know whether it's better to use the atomics & concurrent distinction or to work in some concept of scope, but I agree with Greg that this needs to clarify exactly what is guaranteed.

Would it help to add a new extended memory scope like memory_scope::ext_oneapi_peer_devices?

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

JackAKirk · 2022-10-18T15:35:16Z

I think this looks quite good from the point of view of the cuda backend (apart from the one issue I describe below). I can try a simple implementation to make sure there are no other issues with CUDA implementing this.

There is one point that I'd like to clarify: Currently some backends (e.g. level_zero is implemented already) can do direct P2P copies for buffers. I think we should consider whether it is required (or not) for users to call ext_oneapi_enable_peer_access first before this buffer P2P opt is enabled.

In the CUDA backend in order to enable P2P copy (as well as enable P2P access) of memory from one device to another it would be necessary to call cuCtxEnablePeerAccess: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g0889ec6728e61c05ed359551d67b3f5a

This means that if we want to disentangle the buffer P2P optimization from the USM P2P access feature, when the runtime does a P2P buffer copy we would need to have an implementation of the buffer P2P copy optimization in the CUDA backend do:

ext_oneapi_enable_peer_access (calls cuCtxEnablePeerAccess in CUDA backend)
Perform P2P copy
ext_oneapi_disable_peer_access (calls cuCtxDisablePeerAccess in CUDA backend)

I think this would mean that in order to ensure that the (buffer) enabled peer to peer access doesn't interfere with a users usage of this USM extension (via the expected result of ext_oneapi_can_access_peer), we would have to make the buffer copy block the host thread, to prevent it from calling the ext_oneapi_can_access_peer that would temporarily return true during the time that the buffer is performing the copy. Alternatively we could miss the ext_oneapi_disable_peer_access call at the end of the buffer copy in which case ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) could return true without the user explicitly calling ext_oneapi_enable_peer_access.

Perhaps this is only an issue for the CUDA backend?, but this sounds pretty messy already, and I think it could be a good idea to avoid these issues by connecting this USM extension with the expected behavior of any buffer P2P copy optimization such that
we could state:

The user is required to call ext_oneapi_enable_peer_access to switch on any available buffer direct P2P copy optimization between different devices.

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

npmiller · 2022-10-20T16:18:05Z

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

+This extension adds support for mechanisms to query and enable support for
+direct memory access between peer devices in a system.
+In particular, this allows one device to directly access USM Device
+allocations for a peer device in the same context.


If two devices with P2P capabilities are placed in the same context, shouldn't this be implicitly enabled?

There has been a lot of discussion about what a context means. I think our current consensus is that it does not provide any guarantee about P2P access between devices. Therefore, placing two devices in the same context does not provide any guarantee that USM memory allocated for one of those devices is accessible from another device in that same context.

See the discussion in internal Khronos issue 563.

gmlueck · 2022-10-24T20:59:17Z

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

Two comments:

Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation? Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance. (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)
Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs. Instead, the extension could maintain some internal state that virtualizes the enable / disable state. This would allow ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) to return false even if P2P access was currently enable due to the buffer-copy optimization.

JackAKirk · 2022-10-25T13:57:51Z

* Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs.  Instead, the extension could maintain some internal state that virtualizes the enable / disable state.  This would allow `ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled)` to return `false` even if P2P access was currently enable due to the buffer-copy optimization.

Sure we could deal with that in the cuda backend if we can infer the location of the pointers provided to q.memcpy. I don't immediately know how we can most straightforwardly do that because I don't see a cu device API query that takes a pointer and returns whether it is device or host allocated, but I can ask if anyone knows if this exists (of course we can defer to some more complex/slower/heavier implementation to retrieve locations (Host or Device) of pointers, but I don't think anyone will argue this is desirable if avoidable).

We would need to be able to know this so that we could move from the current cuda implementation of q.memcpy that uses cuMemcpyAsync which infers the type of copy (D2H H2D D2D) from the pointer addresses. If we managed to do this we could feasibly (in the D2D case) use the information as provided in ext_oneapi_can_access_peer to call cuMemcpyDtoHAsync then cuMemcpyHtoDAsync instead of cuMemcpyAsync (or cuMemcpyPeer) if ext_oneapi_can_access_peer returns false for the access query (or cuMemcpyDtoDAsync if the D2D copy is intra-device). If cuCtxEnablePeerAccess has been called (and no error returned and the corresponding Disable API not called), then any call to cuMemcpyAsync (or cuMemcpyPeer) will do a P2P copy when the src and device are on different peer access enabled devices. If cuCtxEnablePeerAccess has not been called then any call to cuMemcpyAsync (or cuMemcpyPeer) will do D2H H2D instead.

Now if we did this note that this would not completely remove interactions between buffer P2P and USM P2P, although the interactions that remain I think would only be for rather unusual use cases for which we need to consider the limitations of the number of active (active means cuCtxEnablePeerAccess has been called and returned without error) peer connections:

For CUDA devices this maximum number of peers is set by either:

On non-NVSwitch enabled systems, each device can support a system-wide maximum of eight peer connections.
When NVSwitch is used the maximum number of peers per device is given in a table in the following link: https://www.nvidia.com/en-gb/data-center/nvlink/ 6 for volta, 12 for Ampere, 18 for Hopper.

So if we had a hypothetical system of 9 non NVSwitch peers, 8 active connections for a single device for buffer P2P, then any user calls to ext_oneapi_enable_peer_access to enable a connection from the device to the remaining peer will have to deal with the CUDA_ERROR_TOO_MANY_PEERS thrown by cuCtxEnablePeerAccess gracefully and return false to ext_oneapi_enable_peer_access, until one of the peer connections used by buffer P2P is disabled.

I guess that there is a similar max peer constraint for level_zero?

* Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation?  Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance.  (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)

My testing experience has been that execution times of cuCtxEnablePeerAccess and cuCtxDisablePeerAccess are quite a lot shorter than even very small copies (smallest I checked last year was of order 10-100 bytes). Note that I was not advocating calling cuCtxEnablePeerAccess and cuCtxDisablePeerAccess for every buffer P2P copy!. I just want to put the behaviour/constraints of the cuda peer functionality out there for you, and it is easier to do this by way of examples. There is also the question of the hip peer functionality. At the level of the runtime the APIs used in HIP match CUDA for the peer capabilities, however we haven't looked into the lower level implementation. It's probably a good first guess that the constraints of the HIP peer implementation will match cuda although this isn't guaranteed until we check.

For completeness the other thing to mention for the cuda case is that, unlike the corresponding level_zero case, peer access is granted between cuContexts rather than cuDevices: see the declaration of cuCtxEnablePeerAccess. This is an implementation detail (I'm not saying this should motivate breaking your proposed extension) but it is probably worth mentioning how this interacts with the current state of DPC++ runtime and PI_CUDA. Basically the fact that we can have multiple sycl::contexts mapped to a single sycl::device, and then multiple cuContexts per cuDevice, and the fact that sycl::device knows nothing about sycl::context currently, complicates things (Of course there are ways we can work around this).

As an aside this is another small issue motivating why we are interested in the context questions that still don't appear to have clear answers:
Does the (added complexity of) the sycl::context abstraction (and the constraints that it imposes) have a real use case for:
a) DPC++ target users (SYCL spec + DPC++ extensions scope)
b) wider subset of all SYCL target users (SYCL spec scope)

For the purposes of easily testing the performance of queue.memcpy for different numbers of concurrent Peer connections I had to use a free function to switch on the access:

sycl::ext::oneapi::experimental::ext_oneapi_enable_peer_access(
const queue& active, const queue& peer);
}

This works because sycl::queue knows about sycl::context. Again I'm not suggesting the Device member function API change at all; rather I'm providing this information to try to represent the cuda peer functionality, and point out how it interacts with sycl::device, queue, and context concepts.
Hopefully once we know all requirements of level_zero, cuda and more, it can make the most natural solution more obvious.

Signed-off-by: James Brodman <james.brodman@intel.com>

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

JackAKirk · 2023-02-08T15:58:36Z

@jbrodman @gmlueck
It looks like #8197 will be merged soon. This PR makes it much easier for the nvidia backend to implement this extension. I've implemented this P2P proposal in the nvidia backend on top of #8197 here: https://github.com/JackAKirk/llvm/tree/P2P-primary-ctxt.
When #8197 is merged I can open a PR for this draft implementation.

Here are some small issues I came across:

The docs for cuDeviceGetP2PAttribute here( https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g496bdaae1f632ebfb695b99d2c40f19e) (and cudaDeviceGetP2PAttribute in Runtime API) are basically wrong where it says "CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED P2P: 1 if P2P Access is enable." This is wrong because (As you would actually expect from its name) when CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED attr is used in cuDeviceGetP2PAttribute it is functionally identical to cudaDeviceCanAccessPeer: i.e it corresponds to peer_access::access_supported in this proposal, and peer_access::access_enabled is not natively supported in Nvidia. I have explicitly verified this using both the cuda Runtime and Driver APIs.
Another different wrt the level_zero functions (https://spec.oneapi.io/level-zero/latest/core/api.html#zedevicegetp2pproperties) is that if PeerDevice == Device then these nvidia queries throw errors: Nvidia does not see a device as a peer to itself, whereas L0 spec (and this extension doc) does. In the draft impl I have implemented it so that a device is seen as a peer to itself always, and that also it returns true always when the Native_Atomics query is used for PeerDevice == Device.

I will add some corresponding tests to exhibit all this functionality soon and link you them. I thought I should just let you know what I found from my investigations. Apart from the fact that peer_access::access_enabled cannot be natively supported by Nvidia, the extension spec looks good for the cuda backend.

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

JackAKirk · 2023-02-09T11:30:37Z

It could be good to include or link to example usage in this doc: something based on/ similar to:


// note: practically it could also be good to provide clear directions to
// documentation showing users how to make sure they are constructing queues
// using distinct devices.

auto Dev0 = Queues[0].get_device();
auto Dev1 = Queues[1].get_device();

int *arr0 = malloc<int>(N, Queues[0], usm::alloc::device);
int *arr1 = malloc<int>(N, Queues[1], usm::alloc::device);

// note: in real use would obviously load/set arr0/arr1 with meaningful data.

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_enable_peer_access(Dev1);
}

// Copy will be made by P2P if supported
Queues[0].copy(arr0, arr1, N).wait();

// Copy will not be made by P2P since it has not been enabled
Queues[1].copy(arr1, arr0, N).wait();

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_disable_peer_access(Dev1);
}

// Copy will not be made by P2P if supported because it has now been disabled
Queues[0].copy(arr0, arr1, N).wait();

And then another example for peer access (the above is an example of peer copy). Also I have apparently discovered that P2P via nvlink with Nvidia hardware is bi-directional, such that Dev0.ext_oneapi_enable_peer_access(Dev1); also enables Dev1 to do a P2P copy to Dev0. I think that this is not the case via pcie, but I need to check this.

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Signed-off-by: James Brodman <james.brodman@intel.com>

gmlueck

Looks good. Just a couple spelling mistakes.

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

gmlueck

LGTM

JackAKirk · 2023-03-02T18:37:39Z

Hi @jbrodman @gmlueck

Just to confirm so I can update the implementation:

Is the extension OK'd now by syclomatic?
I take it that you still want access_enabled? That is fine I just wanted to double check this is the case now that we know it isn't natively supported by cuda unlike what the cuda docs imply.

Signed-off-by: James Brodman <james.brodman@intel.com>

JackAKirk · 2023-03-03T16:09:34Z

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

+namespace oneapi {
+enum class peer_access {
+  access_supported,
+  access_enabled,


access_enabled was removed below, but not here.

This implements the current extension doc from #6104 in the CUDA backend only. Fixes #7543. Fixes #6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

Initial commit of oneapi extension proposal for adding Peer-to-Peer m…

19bc35a

…echanisms to SYCL Signed-off-by: James Brodman <james.brodman@intel.com>

jbrodman requested a review from a team as a code owner May 5, 2022 19:26

gmlueck reviewed May 6, 2022

View reviewed changes

JackAKirk mentioned this pull request May 10, 2022

[CUDA] Multi-device context support in CUDA backend #4381

Closed

Update based on feedback

96ff0cc

Signed-off-by: James Brodman <james.brodman@intel.com>

gmlueck reviewed May 20, 2022

View reviewed changes

Updates based on feedback, round 2

164f462

Signed-off-by: James Brodman <james.brodman@intel.com>

jbrodman requested a review from gmlueck September 8, 2022 18:51

gmlueck reviewed Sep 8, 2022

View reviewed changes

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

2eb8ff8

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

npmiller reviewed Oct 20, 2022

View reviewed changes

This was referenced Nov 28, 2022

[SYCL][CUDA] memory access across peer devices #7543

Closed

NVIDIA MultiGPU support for SYCL #6749

Closed

npmiller mentioned this pull request Nov 29, 2022

Consider interface to expose P2P capabilities oneapi-src/unified-runtime#68

Closed

jbrodman added 3 commits January 30, 2023 15:45

Feedback vol 3

e3b12fc

Signed-off-by: James Brodman <james.brodman@intel.com>

Merge branch 'sycl' of github.com:intel/llvm into ext_p2p

15e306d

fix merge

410d806

Signed-off-by: James Brodman <james.brodman@intel.com>

npmiller mentioned this pull request Feb 3, 2023

[SYCL][CUDA] Decouple CUDA contexts from PI contexts #8197

Merged

gmlueck reviewed Feb 8, 2023

View reviewed changes

JackAKirk mentioned this pull request Feb 10, 2023

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

Merged

jbrodman and others added 4 commits February 28, 2023 15:38

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

d28af85

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

0c4d39d

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

ab8625e

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Adjust wording based on feedback and bump SYCL spec rev

006dd20

Signed-off-by: James Brodman <james.brodman@intel.com>

gmlueck reviewed Mar 1, 2023

View reviewed changes

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc Outdated Show resolved Hide resolved

jbrodman and others added 2 commits March 2, 2023 13:30

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

2329d71

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update sycl/doc/extensions/proposed/sycl_ext_oneapi_peer_access.asciidoc

232ce60

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

gmlueck approved these changes Mar 2, 2023

View reviewed changes

Remove enabled query as it's uncclear any backend really supports it

549ce0d

Signed-off-by: James Brodman <james.brodman@intel.com>

steffenlarsen merged commit b5bce77 into intel:sycl Mar 3, 2023

JackAKirk self-requested a review March 3, 2023 16:08

JackAKirk reviewed Mar 3, 2023

View reviewed changes

JackAKirk mentioned this pull request Mar 3, 2023

[SYCL][CUDA] Introduced USM P2P tests intel/llvm-test-suite#1631

Draft

JackAKirk mentioned this pull request Jun 25, 2024

[L0] Return success on enable/disable p2p. oneapi-src/unified-runtime#1786

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

jbrodman commented May 5, 2022

JackAKirk commented May 9, 2022 •

edited

Loading

jbrodman commented May 19, 2022

gmlueck Sep 8, 2022

jbrodman Sep 9, 2022

gmlueck Sep 9, 2022

Pennycook Sep 9, 2022 •

edited

Loading

gmlueck Sep 12, 2022

JackAKirk commented Oct 18, 2022

npmiller Oct 20, 2022

gmlueck Oct 24, 2022

gmlueck commented Oct 24, 2022

JackAKirk commented Oct 25, 2022 •

edited

Loading

JackAKirk commented Feb 8, 2023 •

edited

Loading

JackAKirk commented Feb 9, 2023 •

edited

Loading

gmlueck left a comment

gmlueck left a comment

JackAKirk commented Mar 2, 2023

JackAKirk Mar 3, 2023

jbrodman Mar 3, 2023

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

Conversation

jbrodman commented May 5, 2022

JackAKirk commented May 9, 2022 • edited Loading

jbrodman commented May 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pennycook Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Oct 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmlueck commented Oct 24, 2022

JackAKirk commented Oct 25, 2022 • edited Loading

JackAKirk commented Feb 8, 2023 • edited Loading

JackAKirk commented Feb 9, 2023 • edited Loading

gmlueck left a comment

Choose a reason for hiding this comment

gmlueck left a comment

Choose a reason for hiding this comment

JackAKirk commented Mar 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented May 9, 2022 •

edited

Loading

Pennycook Sep 9, 2022 •

edited

Loading

JackAKirk commented Oct 25, 2022 •

edited

Loading

JackAKirk commented Feb 8, 2023 •

edited

Loading

JackAKirk commented Feb 9, 2023 •

edited

Loading