[CUDA] P2P buffer/image memory copy #4401

JackAKirk · 2021-08-25T10:22:14Z

This PR introduces cuda Peer to Peer memory copies that are currently only made between Peers in different contexts due to this issue: #4381.

Following a discussion in #4332, this PR assumes that at least one backend will require a binary device query (implemented here as piextP2P) that returns a boolean value indicating whether that particular backend's P2P API can be called for the src and dst devices in question. For the cuda backend this binary device query always returns true, since in the case that P2P memory copy is topologically impossible the API is lowered to a memory copy via the host via a single sycl event in a more efficient manner than the SYCL runtime can perform via two events.

This PR is marked WIP because a 'redundancy check' that checks whether memory is in the correct context in Commands::processDepEvent has been removed without cascading (simplying) changes to Commands having yet been also made. The responsibility for the checking of memory context appears to be Graph_builder::addCG, such that when changes are made in Graph_builder::addCG unexpected effects could previously occur in Commands::processDepEvent. If the 'redundancy check' does have a use case in a different backend that I am not aware of I can reintroduce it, and deal with the undesired effects (unnecessary host task connection command) in Commands::processDepEvent that occurred due to the change in Graph_builder::addCG via an additional call to the binary device query in Commands::processDepEvent. Hopefully the CI tests can give an indication of this.

Corresponding llvm-test-suite PR is ready here: https://github.com/intel/llvm-test-suite/compare/intel...JackAKirk:P2P_cuda_tests?expand=1.

…ontexts. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Switched off redundancy check creating conncmd in Command::processDepEvent. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

alexbatashev · 2021-08-25T10:49:44Z

sycl/include/CL/sycl/detail/pi.def

@@ -33,6 +33,7 @@ _PI_API(piextDeviceSelectBinary)
 _PI_API(piextGetDeviceFunctionPointer)
 _PI_API(piextDeviceGetNativeHandle)
 _PI_API(piextDeviceCreateWithNativeHandle)
+_PI_API(piextP2P)


@smaslov-intel I think that note on Line 17 is not enough. We probably want some test, that would fail in a hard way.

Yes, I missed the note. I can apply its guidance in a new commit.

The note is also a little ambiguous for me, should I place new items at the end of the list or at the end of the category to which they fit. I.e. is piextP2P in the right place currently or should it be placed after _PI_API(piTearDown)? Maybe I only have to move the Peer memcpy API's?

@JackAKirk all new values should go after piTearDown. The idea is that those macros are expanded into a large enum, which is used in multiple places, including instrumentation APIs, that use values of that enum as keys to what PI call is being executed.

OK makes sense thanks.

piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?

Sounds good. I made the suggested change.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen · 2021-08-25T12:04:28Z

sycl/source/detail/scheduler/graph_builder.cpp

+                ->getDeviceImplPtr()
+                ->getHandleRef(),
+            &p2p);
+        if (!(p2p && Queue->get_device().get_platform().get_backend() ==


Suggested change

if (!(p2p && Queue->get_device().get_platform().get_backend() ==

if (!(p2p && detail::getImplBackend(Queue) ==

Thanks, I made the change. Note that I had to include backend_impl.hpp in graph_builder.cpp for this.

steffenlarsen · 2021-08-25T12:04:34Z

sycl/source/detail/scheduler/graph_builder.cpp

+                         Record->MCurContext->getDevices()[0]
+                             .get_platform()
+                             .get_backend()))


Suggested change

Record->MCurContext->getDevices()[0]

.get_platform()

.get_backend()))

detail::getImplBackend(Record->MCurContext)))

steffenlarsen · 2021-08-25T12:08:42Z

sycl/include/CL/sycl/detail/pi.def

@@ -33,6 +33,7 @@ _PI_API(piextDeviceSelectBinary)
 _PI_API(piextGetDeviceFunctionPointer)
 _PI_API(piextDeviceGetNativeHandle)
 _PI_API(piextDeviceCreateWithNativeHandle)
+_PI_API(piextP2P)


piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?

steffenlarsen · 2021-08-25T12:19:50Z

sycl/plugins/cuda/pi_cuda.cpp

+    auto dst_context = dst_queue->get_context()->get();
+    auto src_context = src_queue->get_context()->get();
+
+    cuCtxEnablePeerAccess(src_context, 0);


Is this required for doing P2P copies? My understanding was that this is to enable direct P2P access through the CUDA unified addressed pointers.

Have you tried benchmarking with and without this call? To my knowledge cuMemcpyPeerAsync should work either way, but maybe you're right that it needs cuCtxEnablePeerAccess to do enable actual P2P copying. Would be good to have it confirmed though, as otherwise you might need to disable it somewhere as well as it appears, based on the existence of CUDA_ERROR_TOO_MANY_PEERS, that it is a limited resource.

Again assuming you need this call, then maybe it would be a good idea to guard it by a call to cuDeviceCanAccessPeer.

Have you tried benchmarking with and without this call? To my knowledge cuMemcpyPeerAsync should work either way, but maybe you're right that it needs cuCtxEnablePeerAccess to do enable actual P2P copying.

Yes without the call to cuCtxEnablePeerAccess cuMemcpyPeerAsync performs the copy via the host.

you might need to disable it somewhere as well as it appears,
based on the existence of CUDA_ERROR_TOO_MANY_PEERS, that it is a limited resource.

Yes good idea, I am currently trying to test it with many peers on Cori. I'm not yet sure whether it is possible with the maximum number of connections of four/eight peers in total (perlmutter/Cori cases), but probably it does no harm to call cuCtxEnablePeerAccess after cuMemcpyPeerAsync etc to cover all cases.

Again assuming you need this call, then maybe it would be a good idea to guard it by a call to cuDeviceCanAccessPeer.

So I believe that the call to cuDeviceCanAccessPeer is unnecessary for this use case since cuCtxEnablePeerAccess checks whether cuDeviceCanAccessPeer is false and is currently silently returning the corresponding error, before cuMemcpyPeerAsync is called anyway (making the copy via the host if cuDeviceCanAccessPeer is false).

Yes without the call to cuCtxEnablePeerAccess cuMemcpyPeerAsync performs the copy via the host.

Good to know! I've blindly used it in the past for this purpose, but got confused by it again because of the documentation wording. Nice to know that I wasn't being silly back then.

Yes good idea, I am currently trying to test it with many peers on Cori. I'm not yet sure whether it is possible with the maximum number of connections of four/eight peers in total (perlmutter/Cori cases), but probably it does no harm to call cuCtxEnablePeerAccess after cuMemcpyPeerAsync etc to cover all cases.

I am not sure of the overhead incurred. Maybe it is fine to keep it enabled for future cases. What is the cost of calling cuCtxEnablePeerAccess on a context that already has it enabled? If it is costly, then maybe the context should be marked as having peer access enabled so the call can be skipped in subsequent P2P operations.

So I believe that the call to cuDeviceCanAccessPeer is unnecessary for this use case since cuCtxEnablePeerAccess checks whether cuDeviceCanAccessPeer is false and is currently silently returning the corresponding error, before cuMemcpyPeerAsync is called anyway (making the copy via the host if cuDeviceCanAccessPeer is false).

As long as it returns fast and in a stable way when direct P2P support isn't possible, I agree that the cuDeviceCanAccessPeer would be redundant.

'piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?'

Thanks, I've made the suggested change.

I am not sure of the overhead incurred. Maybe it is fine to keep it enabled for future cases. What is the cost of calling cuCtxEnablePeerAccess on a context that already has it enabled? If it is costly, then maybe the context should be marked as having peer access enabled so the call can be skipped in subsequent P2P operations.

I think that it is negligible but my tests so far haven't been comprehensive. I'm aiming to test on the Cori nodes using multiple GPU's and memory copy routes, so should have a clearer answer soon. I'll also investigate cuCtxDisablePeerAccess.

From this tensorflow issue it appears that when the nvidia docs say that 8 peers are supported this accounts for 8x8 peer access directions (Although there is a mention that up to 9 devices are actually supported in that tensorflow issue). There is still a problem with connecting multiple devices on Cori so I have not been able to verify the behaviour for up to 8 devices yet. I think that if more n > 8/9 devices are used then the n*n connections will result in CUDA_ERROR_TOO_MANY_PEERS unless there are calls to cuCtxDisablePeerAccess corresponding with each call to cuCtxEnablePeerAccess. I have experimented a little with cuCtxDisablePeerAccess and the obvious complication is that when the copy is asynchronous cuCtxDisablePeerAccess needs to wait on the memcopy call.
LLVM OpenMP has the same approach as used here whereby cuCtxEnablePeerAccess is called prior to every peer memory copy call without any later calls to cuCtxDisablePeerAccess. The difference in the LLVM OpenMP approach with respect to this PR is that if Peer access is unavailable they explicitly call memcpyDtoD (which itself does the copy via the D2H H2D). I found that this has no advantage over calling cuMemcpyPeerAsync everytime; the performance is identical.
The LLVM OpenMP implementation also outputs a debug message if cuDeviceCanAccessPeer returns false before continuing with the fallback. I'm not sure if such a debug mode message is desired here or not?

For two devices I have confirmed that repeated calls to cuCtxEnablePeerAccess are extremely fast: for multiple cuCtxEnablePeerAccess/cuMemcpyPeerAsync calls the timing of cuCtxEnablePeerAccess was the only one output in units of nanoseconds! Another option would be to do what it sounds like tensorflow does and call cuCtxEnablePeerAccess n*n times for n devices once the contexts are created. A problem with this is that PeerAccess can be enabled between devices when the user does not request/use the access.

romanovvlad · 2021-08-25T13:55:29Z

sycl/source/detail/scheduler/graph_builder.cpp

+      } else if (!Queue->is_host() && !Record->MCurContext->is_host()) {
+        bool p2p = false;
+        Queue->getPlugin().call<PiApiKind::piextP2P>(
+            Queue->getDeviceImplPtr()->getHandleRef(),
+            findAllocaCmd(Record)
+                ->getQueue()
+                ->getDeviceImplPtr()
+                ->getHandleRef(),
+            &p2p);
+        if (!(p2p && Queue->get_device().get_platform().get_backend() ==


What happens it we have a context/device from OpenCL GPU AMD and a context/device from OpenCL GPU Intel here?
How piextP2P function can tell if p2p is supported between those devices?

As it is currently implemented piextP2P does not take account of whether the two devices share the same backend. Currently it is not supported for P2P memory copy across devices not sharing the same backend (enforced the additional check on line 958).

@romanovvlad has a point. If someone was to go in and implement piextP2P for one of the other backends that actually care about the devices being passed, it's not unreasonable to assume that both devices are valid devices from that backend and I don't think they'd have a good way of telling whether or not they are. I think it would make sense to do the backend check before calling piextP2P.

Yes it makes sense. I made the change.

romanovvlad · 2021-08-25T14:05:15Z

sycl/source/detail/memory_manager.cpp

+      } else {
+        Plugin.call<PiApiKind::piextEnqueueMemBufferCopyPeer>(
+            Queue, SrcMem, QueueDst, DstMem, SrcXOffBytes, DstXOffBytes,
+            SrcAccessRangeWidthBytes, DepEvents.size(), DepEvents.data(),
+            &OutEvent);
+      }


Do not think it's a good approach. I still think that having the same context as an indicator of that the memory can be copied without going thru host(from SYCL RT POV) is a good abstraction and it's not immediately clear why additional ways to bypass it are needed on this level.

The main reason that the Peer to Peer copy is made across contexts is because only a single device per context is currently supported for cuda backend: #4381.

I suggest modifying cuda plugin/backend to support multiple devices per context.

This feature could still be useful for doing explicit memory copies between accessors, as I don't know of any restriction requiring the parent contexts to be the same for both accessors.

I guess my point is, even though it might be a relatively infrequent operation it is gained performance for those that insist on making explicit memory copies, even after the CUDA backend supports multiple devices per context.

This feature could still be useful for doing explicit memory copies between accessors,

Hm. It's interesting. Currently if accessors are from different context we make sure(move) all of them are in the context which the queue is bound to and then do the copy.

What would happen if we enable P2P copies in SYCL RT in this case? Which context resulting sycl::event would be bound to? Queue context or a context where data for src accessors is or where data for dst accessor is?

If both devices were in the same cuda plugin context this copy could be handled in the piEnqueueBufferCopy in the cuda plugin by either doing regular copy or P2P. So, it seems we can do at least the same if we go with "cuda plugin exposes several devices in the same context" approach.

Currently if accessors are from different context we make sure(move) all of them are in the context which the queue is bound to and then do the copy.

I half-assumed it took this path as well. That is interesting!

What would happen if we enable P2P copies in SYCL RT in this case? Which context resulting sycl::event would be bound to? Queue context or a context where data for src accessors is or where data for dst accessor is?

I would think it would be on the queue, given the memory copy operation is "enqueued" on that queue. However, thinking about it again I am not sure how you'd even do explicit copy between accessors of different contexts, as device accessors are requested through the command group handle.

Instead, I think the place it would be useful is when migrating accessors between devices in the runtime (is that what is happening here?) I.e.

sycl::queue queueA, queueB; // Assume queueA and queueB have disjoint sets of devices. buffer<int, 1> buff{range<1>{1024}}; queueA.submit([&](handler& cgh) { accessor accessorA{buff, cgh, read_write}; // Do some work here }); queueB.submit([&](handler& cgh) { accessor accessorB{buff, cgh, read_write}; // Should migrate data from queueA parent context to queueB parent context. // Do more work here });

If both devices were in the same cuda plugin context this copy could be handled in the piEnqueueBufferCopy in the cuda plugin by either doing regular copy or P2P. So, it seems we can do at least the same if we go with "cuda plugin exposes several devices in the same context" approach.

I completely agree. The functionality exposed is likely also useful for when support for multiple devices per context is implemented in the CUDA backend. However, like in the example mentioned above it is the user that have (implicitly) ordered a data movement between two different contexts, which the backend cannot change. We should optimize for this case as well, if we can.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Included header for backend_impl.hpp in graph_builder.cpp so that detail::getImplBackend may be called. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

smaslov-intel · 2021-08-26T00:14:42Z

sycl/include/CL/sycl/detail/pi.def

+_PI_API(piextEnqueueMemBufferCopyPeer)
+_PI_API(piextEnqueueMemBufferCopyRectPeer)
+_PI_API(piextEnqueueMemImageCopyPeer)


why do we need these new API? why wouldn't regular copy API perform P2P copies transparently under the hood?

The regular API takes a single pi_queue whereas the Peer API requires a second queue as an argument (principally so that the second context is known). The regular API is an OpenCL interface so cannot be changed. I think that a single API could be used if the new piext*** API was used in the runtime to replace the regular copy API.

the Peer API requires a second queue as an argument (principally so that the second context is known).

The pi_mem src & dst are created with interfaces that have context, e.g. piextUSMDeviceAlloc or piMemBufferCreate, so backends already know the context of both src and dst, and can act on that.

OK I missed that when I checked out pi_mem. There is another reason for providing both queues from e.g. this snippet in
cuda_piextEnqueueMemBufferCopyRectPeer in pi_cuda.cpp line 4054:

try { ScopedContext active(dst_queue->get_context()); if (event_wait_list) { retErr = cuda_piEnqueueEventsWait(src_queue, num_events_in_wait_list, event_wait_list, nullptr); } if (event) { retImplEv = std::unique_ptr<_pi_event>(_pi_event::make_native( PI_COMMAND_TYPE_MEM_BUFFER_COPY_RECT, dst_queue)); retImplEv->start(); }

We wait on events associated with the source queue and return the event associated with the dest queue.
There were problems associated with returning the event associated with the src queue. Since the contexts can be found from pi_mem I will look at the again and see if there is a way of doing things without the second queue argument.

It turns out that it is fine to only pass a single queue in the PI which acts as the command_queue, and it is fine for either the src_queue or the dst_queue to act as the command queue.

It is my current understanding that all implementation details of implicit peer to peer memory copy calls for buffer memory between devices sharing a SYCL context should be dealt with by the PI, such that the only implicit peer to peer memory copy case that should be dealt with by the runtime (via memory_manager) is the cross context case.

I will implement the peer to peer via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

I have now made the changes that I described above, implementing the peer to peer copy via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

smaslov-intel · 2021-08-26T00:15:58Z

sycl/include/CL/sycl/detail/pi.def

+_PI_API(piextEnqueueMemBufferCopyPeer)
+_PI_API(piextEnqueueMemBufferCopyRectPeer)
+_PI_API(piextEnqueueMemImageCopyPeer)
+_PI_API(piextDevicesSupportP2P)


can we re-use existing piDeviceGetInfo with a new PI_DEVICE_INFO_P2P_DEVICES, which would return all devices that have P2P connection to this device?

I don't think that there is any cuda/hip driver API that takes a device as an argument and returns information on P2P connection status with all other devices. See https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html for the cuda API's. I think that any plugin API attempting to return all devices that have P2P connection to a given device would require a list of all other devices (or equivalent) as an argument.

I don't think that there is any cuda/hip driver API that takes a device as an argument and returns information on P2P connection status with all other devices.

I don't think so either, but you should be able to get back to the platform from your PI device and get a list of all devices, then you can check for each device if access is allowed, or in the case of the CUDA backend just return the same list without the corresponding device in it.

As an additional note, maybe in that case it would make sense to make it a bit more granular, e.g. having two new queries PI_DEVICE_INFO_P2P_READ_DEVICES and PI_DEVICE_INFO_P2P_WRITE_DEVICES (or something similar) that allows for monodirectional interconnects, just in case.

That sounds fine in the case that P2P copies are forbidden between devices not sharing the same platform. If cross context P2P copies are allowed wouldn't we need to access a list of all platforms?

Good question. I'm not convinced that cross-platform P2P should be allowed, but I suppose a backend can make that decision. Currently the CUDA backend only has one platform so it should be sufficient to just use that one. If at some point the CUDA backend introduces more platforms and P2P should be allowed between them, calling piPlatformsGet or replicating the internal logic should get you all platforms.

OK I see. Sure I can do it like that.

I've addressed this now. If the change is acceptable, and providing we keep the cross context memory copy runtime impl, then I will implement piDeviceGetInfo for the other PI's that don't have it, since all PI's will call it now in graph_builder.

… of readable peer devices. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

the srcQueue can be as the command_queue for peer to peer copy. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen · 2021-10-20T12:43:49Z

sycl/plugins/esimd_cpu/pi_esimd_cpu.cpp

+    // P2P is currently unsupported in level zero
+  case PI_DEVICE_INFO_P2P_READ_DEVICES:


Suggested change

// P2P is currently unsupported in level zero

case PI_DEVICE_INFO_P2P_READ_DEVICES:

case PI_DEVICE_INFO_P2P_READ_DEVICES:

// P2P is currently unsupported in level zero

Makes the association clearer and more consistent with other cases. Similar applies to sycl/plugins/hip/pi_hip.cpp and sycl/plugins/level_zero/pi_level_zero.cpp.

steffenlarsen · 2021-10-20T13:31:57Z

sycl/plugins/cuda/pi_cuda.cpp

+
+    for (const auto &dev : device->get_platform()->devices_) {
+      devs.emplace_back(dev.get());
+    }


After #4571 this will only report itself as supporting P2P as each device has its own platform, right?

If the intention is still to report all devices as they can do explicit P2P memory copies between them, even though some go through host, I wonder if PI_DEVICE_INFO_P2P_READ_DEVICES is an ambiguous name as it could also imply that the device can access the other devices memory directly, which would require cuDeviceCanAccessPeer checks.

After #4571 this will only report itself as supporting P2P as each device has its own platform, right?

Yes. After #4571 this PR no longer acts as a short term fix for enabling direct P2P copies via the cross context route, before the CUDA PI is adapted to allow multiple cuda contexts per sycl context, allowing the preferred intra-context P2P route. #4571 can only be reverted once it becomes possible to do the P2P copy between devices sharing a sycl context. It still has the value in the cross-context optimization you described below - If for whatever reason a user decides to split backend devices/contexts into separate sycl contexts, then we can also optimize this case. Before the change introduced in #4751 is reverted this PR will simply do nothing.
I think that the most interesting remaining aspect of this PR is that it has revealed a lot of apparently redundant code which complicates graph_builder and commands via the unused ConnectionCmd return pointer. This appears to be a relic of an earlier obsolete implementation: now the job of making sure the memory is in the correct context always appears to be carried out in graph_builder::addCG.

If the intention is still to report all devices as they can do explicit P2P memory copies between them, even though some go through host, I wonder if PI_DEVICE_INFO_P2P_READ_DEVICES is an ambiguous name as it could also imply that the device can access the other devices memory directly, which would require cuDeviceCanAccessPeer checks.

Yes I can try to pick a better name. 'PI_DEVICE_INFO_P2P_READ_DEVICES' returns a set of devices which will be a superset of the set of devices that are capable of doing P2P, since it can include cases that will revert to the Device to Host to Device route. It is also a query which is not exposed to the user, so I think that a slightly longer, uglier, but more descriptive name would be OK: particularly since we don't want to confuse it with future queries: 'PI_DEVICE_INFO_P2P_READ_DEVICES' is probably more appropriate to reserve for a hypothetical future user facing query that tells the user whether a device can access the peers memory directly.

steffenlarsen · 2021-10-20T13:54:36Z

I still think the right approach would be to modify the cuda plugin

to support multiple devices in the same context.

to report devices that support efficient copy in the same platform.

modify SYCL RT to issue piMigrateMem when contexts match but devices are different.

I am not convinced that these have to be mutually exclusive. This handles cases where the runtime needs to do inter-context copies, no matter the backend. You could have two contexts with the same platform with different devices that can do P2P copies so this would be an optimization of that case.

But, whatever option we take we need to make sure it works for Level Zero.

Based on the Level Zero specification there should be a sufficient set of P2P operations to support the changes proposed here. From what I can tell, the behavior of the runtime should stay the same for backends that don't implement PI_DEVICE_INFO_P2P_READ_DEVICES.

JackAKirk · 2021-10-26T13:23:02Z

@romanovvlad
@alexbatashev
@smaslov-intel

We'd like to resolve this PR to get the PI extension for peer-to-peer copies merged, so we propose either merging this PR as-is (some small improvements can be added to address the remaining naming issue) as a temporary solution, or removing the SYCL runtime part and have the PI extension unused until we have multi-device SYCL context support.

As discussed we believe that ConnectionCmd can be safely removed, since no tests fail when Command::processDepEvent returns nullptr always. As we understand it, all possible graphs that require memory moves across contexts are already handled in GraphBuilder::addCG. Although it apparently does nothing in the latest DPC++, the existence of ConnectionCmd can lead to unexpected behavior when changes to GraphBuilder::addCG are made, such as those in this PR. Its removal would considerably declutter the GraphBuilder and Commands classes, making the scheduling model easier to understand and modify.
If you would prefer that we address the ConnectionCmd problem in a separate issue then we can do that.

alexbatashev

lgtm

romanovvlad · 2021-10-28T07:18:43Z

We'd like to resolve this PR to get the PI extension for peer-to-peer copies merged, so we propose either merging this PR as-is (some small improvements can be added to address the remaining naming issue) as a temporary solution, or removing the SYCL runtime part and have the PI extension unused until we have multi-device SYCL context support.

Could you please tell if there is any value in merging PI extension(cuda plugin impl) without doing changes in the SYCL RT?
I do not mind if in scope of this PR we merge plugin changes only and have a separate PR for changes around ConnectionCmd and maybe for change which uses PI extension in SYCL RT(as a short term) for discussion.

romanovvlad · 2021-10-28T07:19:42Z

sycl/include/CL/sycl/detail/pi.h

@@ -296,7 +296,8 @@ typedef enum {
  PI_DEVICE_INFO_MAX_MEM_BANDWIDTH = 0x10026,
  PI_DEVICE_INFO_IMAGE_SRGB = 0x10027,
  PI_DEVICE_INFO_ATOMIC_64 = 0x10110,
-  PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111
+  PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111,
+  PI_DEVICE_INFO_P2P_READ_DEVICES = 0x10112


Could you please add a description for this here?

JackAKirk · 2021-10-28T11:30:14Z

Could you please tell if there is any value in merging PI extension(cuda plugin impl) without doing changes in the SYCL RT? I do not mind if in scope of this PR we merge plugin changes only and have a separate PR for changes around ConnectionCmd and maybe for change which uses PI extension in SYCL RT(as a short term) for discussion.

Merging the PI CUDA part of this PR (minus PI_DEVICE_INFO_P2P_READ_DEVICES) will have no effect on DPC++ without the corresponding changes to the RT that make use of it, but would allow the non-contentious part of this PR to be merged, simplifying any future PR that enables P2P copy usage in the RT.

againull

PI changes look good to me.

AerialMantis · 2022-06-01T11:54:07Z

This PR is on hold until multi-device context support is added (#4381) so has been inactive for a while, but we will resume this once that is done.

JackAKirk and others added 7 commits August 12, 2021 15:22

Implemented P2P copies for the cuda backend using buffers.

4a71379

Switched to using vendor name in P2P info query.

c771093

Merge branch 'intel:sycl' into P2P

b4a99ff

Merge branch 'intel:sycl' into P2P

b38d5e2

Corrected the scoped context in guessLocalWorkSize to prevent stale c…

6abe9fb

…ontexts. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Added binary device query for P2P memcpy instead of platform query.

a3c251e

Switched off redundancy check creating conncmd in Command::processDepEvent. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Corrected formatting.

3cd6911

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk requested review from smaslov-intel and a team as code owners August 25, 2021 10:22

JackAKirk mentioned this pull request Aug 25, 2021

[SYCL][Doc] extension proposal and Impl for ext_oneapi_P2P. #4332

Closed

Corrected Formatting.

c384fbe

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

alexbatashev reviewed Aug 25, 2021

View reviewed changes

Placed new PI API's after piTearDown.

27c073d

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen reviewed Aug 25, 2021

View reviewed changes

romanovvlad requested changes Aug 25, 2021

View reviewed changes

JackAKirk added 3 commits August 25, 2021 15:16

Renamed piextP2P as piextDevicesSupportP2P.

60f276d

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Made check that devices backends match before P2P query.

835e5c4

Included header for backend_impl.hpp in graph_builder.cpp so that detail::getImplBackend may be called. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Corrected formating in graph_builder.cpp.

9e3ef85

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

smaslov-intel reviewed Aug 26, 2021

View reviewed changes

JackAKirk and others added 8 commits August 26, 2021 13:36

Merge branch 'intel:sycl' into P2P

0f36b94

Replaced binary device query with device_info call returning a vector…

0d819c0

… of readable peer devices. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Removed piext Peer functions, replaced them with existing PI copy calls.

43f970c

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Fixed formatting issues following previous commit.

9b9a21f

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

P2P copies made for 1D image arrays again.

52cae9f

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Return retError from call to commonEnqueueMemImageNDCopyPeer.

24b240a

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Removed all changes to memory_manager.cpp:

99e58f1

the srcQueue can be as the command_queue for peer to peer copy. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Superficial change to make the memory_manager.cpp diff empty.

fd15910

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

bader added the cuda CUDA back-end label Sep 3, 2021

JackAKirk requested a review from kbobrovs as a code owner October 1, 2021 12:24

Merge branch 'intel:sycl' into P2P

ebfdb2a

JackAKirk changed the title ~~[WIP][CUDA] P2P buffer/image memory copy~~ [CUDA] P2P buffer/image memory copy Oct 1, 2021

JackAKirk requested a review from alexbatashev October 12, 2021 11:00

bader requested review from againull, steffenlarsen and s-kanaev October 19, 2021 08:34

steffenlarsen reviewed Oct 20, 2021

View reviewed changes

alexbatashev approved these changes Oct 27, 2021

View reviewed changes

romanovvlad reviewed Oct 28, 2021

View reviewed changes

againull approved these changes Nov 2, 2021

View reviewed changes

github-actions bot added the Stale label May 2, 2022

AerialMantis mentioned this pull request May 9, 2022

[CUDA] Multi-device context support in CUDA backend #4381

Closed

JackAKirk mentioned this pull request May 9, 2022

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

Merged

github-actions bot closed this Jun 1, 2022

AerialMantis reopened this Jun 1, 2022

github-actions bot removed the Stale label Jun 2, 2022

JackAKirk marked this pull request as draft November 28, 2022 11:45

smaslov-intel removed request for smaslov-intel and s-kanaev December 21, 2022 20:02

JackAKirk closed this May 23, 2023

	if (!(p2p && Queue->get_device().get_platform().get_backend() ==
	if (!(p2p && detail::getImplBackend(Queue) ==

		// P2P is currently unsupported in level zero
		case PI_DEVICE_INFO_P2P_READ_DEVICES:

[CUDA] P2P buffer/image memory copy #4401

[CUDA] P2P buffer/image memory copy #4401

Conversation

JackAKirk commented Aug 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

JackAKirk Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

JackAKirk Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romanovvlad Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

JackAKirk Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

steffenlarsen commented Oct 20, 2021

JackAKirk commented Oct 26, 2021

alexbatashev left a comment

Choose a reason for hiding this comment

romanovvlad commented Oct 28, 2021

Choose a reason for hiding this comment

JackAKirk commented Oct 28, 2021 • edited Loading

againull left a comment

Choose a reason for hiding this comment

AerialMantis commented Jun 1, 2022

JackAKirk Aug 25, 2021 •

edited

Loading

JackAKirk Aug 25, 2021 •

edited

Loading

JackAKirk Aug 26, 2021 •

edited

Loading

JackAKirk Sep 1, 2021 •

edited

Loading

romanovvlad Aug 25, 2021 •

edited

Loading

JackAKirk Aug 27, 2021 •

edited

Loading

JackAKirk Sep 1, 2021 •

edited

Loading

JackAKirk Oct 20, 2021 •

edited

Loading

JackAKirk commented Oct 28, 2021 •

edited

Loading