Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] P2P buffer/image memory copy #4401

Closed
wants to merge 26 commits into from
Closed

Conversation

JackAKirk
Copy link
Contributor

This PR introduces cuda Peer to Peer memory copies that are currently only made between Peers in different contexts due to this issue: #4381.

Following a discussion in #4332, this PR assumes that at least one backend will require a binary device query (implemented here as piextP2P) that returns a boolean value indicating whether that particular backend's P2P API can be called for the src and dst devices in question. For the cuda backend this binary device query always returns true, since in the case that P2P memory copy is topologically impossible the API is lowered to a memory copy via the host via a single sycl event in a more efficient manner than the SYCL runtime can perform via two events.

This PR is marked WIP because a 'redundancy check' that checks whether memory is in the correct context in Commands::processDepEvent has been removed without cascading (simplying) changes to Commands having yet been also made. The responsibility for the checking of memory context appears to be Graph_builder::addCG, such that when changes are made in Graph_builder::addCG unexpected effects could previously occur in Commands::processDepEvent. If the 'redundancy check' does have a use case in a different backend that I am not aware of I can reintroduce it, and deal with the undesired effects (unnecessary host task connection command) in Commands::processDepEvent that occurred due to the change in Graph_builder::addCG via an additional call to the binary device query in Commands::processDepEvent. Hopefully the CI tests can give an indication of this.

Corresponding llvm-test-suite PR is ready here: https://github.com/intel/llvm-test-suite/compare/intel...JackAKirk:P2P_cuda_tests?expand=1.

JackAKirk and others added 7 commits August 12, 2021 15:22
…ontexts.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Switched off redundancy check creating conncmd in Command::processDepEvent.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@@ -33,6 +33,7 @@ _PI_API(piextDeviceSelectBinary)
_PI_API(piextGetDeviceFunctionPointer)
_PI_API(piextDeviceGetNativeHandle)
_PI_API(piextDeviceCreateWithNativeHandle)
_PI_API(piextP2P)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaslov-intel I think that note on Line 17 is not enough. We probably want some test, that would fail in a hard way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed the note. I can apply its guidance in a new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note is also a little ambiguous for me, should I place new items at the end of the list or at the end of the category to which they fit. I.e. is piextP2P in the right place currently or should it be placed after _PI_API(piTearDown)? Maybe I only have to move the Peer memcpy API's?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk all new values should go after piTearDown. The idea is that those macros are expanded into a large enum, which is used in multiple places, including instrumentation APIs, that use values of that enum as keys to what PI call is being executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I made the suggested change.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
->getDeviceImplPtr()
->getHandleRef(),
&p2p);
if (!(p2p && Queue->get_device().get_platform().get_backend() ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!(p2p && Queue->get_device().get_platform().get_backend() ==
if (!(p2p && detail::getImplBackend(Queue) ==

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I made the change. Note that I had to include backend_impl.hpp in graph_builder.cpp for this.

Comment on lines 959 to 961
Record->MCurContext->getDevices()[0]
.get_platform()
.get_backend()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Record->MCurContext->getDevices()[0]
.get_platform()
.get_backend()))
detail::getImplBackend(Record->MCurContext)))

@@ -33,6 +33,7 @@ _PI_API(piextDeviceSelectBinary)
_PI_API(piextGetDeviceFunctionPointer)
_PI_API(piextDeviceGetNativeHandle)
_PI_API(piextDeviceCreateWithNativeHandle)
_PI_API(piextP2P)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?

auto dst_context = dst_queue->get_context()->get();
auto src_context = src_queue->get_context()->get();

cuCtxEnablePeerAccess(src_context, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required for doing P2P copies? My understanding was that this is to enable direct P2P access through the CUDA unified addressed pointers.

Have you tried benchmarking with and without this call? To my knowledge cuMemcpyPeerAsync should work either way, but maybe you're right that it needs cuCtxEnablePeerAccess to do enable actual P2P copying. Would be good to have it confirmed though, as otherwise you might need to disable it somewhere as well as it appears, based on the existence of CUDA_ERROR_TOO_MANY_PEERS, that it is a limited resource.

Again assuming you need this call, then maybe it would be a good idea to guard it by a call to cuDeviceCanAccessPeer.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried benchmarking with and without this call? To my knowledge cuMemcpyPeerAsync should work either way, but maybe you're right that it needs cuCtxEnablePeerAccess to do enable actual P2P copying.

Yes without the call to cuCtxEnablePeerAccess cuMemcpyPeerAsync performs the copy via the host.

you might need to disable it somewhere as well as it appears,
based on the existence of CUDA_ERROR_TOO_MANY_PEERS, that it is a limited resource.

Yes good idea, I am currently trying to test it with many peers on Cori. I'm not yet sure whether it is possible with the maximum number of connections of four/eight peers in total (perlmutter/Cori cases), but probably it does no harm to call cuCtxEnablePeerAccess after cuMemcpyPeerAsync etc to cover all cases.

Again assuming you need this call, then maybe it would be a good idea to guard it by a call to cuDeviceCanAccessPeer.

So I believe that the call to cuDeviceCanAccessPeer is unnecessary for this use case since cuCtxEnablePeerAccess checks whether cuDeviceCanAccessPeer is false and is currently silently returning the corresponding error, before cuMemcpyPeerAsync is called anyway (making the copy via the host if cuDeviceCanAccessPeer is false).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes without the call to cuCtxEnablePeerAccess cuMemcpyPeerAsync performs the copy via the host.

Good to know! I've blindly used it in the past for this purpose, but got confused by it again because of the documentation wording. Nice to know that I wasn't being silly back then.

Yes good idea, I am currently trying to test it with many peers on Cori. I'm not yet sure whether it is possible with the maximum number of connections of four/eight peers in total (perlmutter/Cori cases), but probably it does no harm to call cuCtxEnablePeerAccess after cuMemcpyPeerAsync etc to cover all cases.

I am not sure of the overhead incurred. Maybe it is fine to keep it enabled for future cases. What is the cost of calling cuCtxEnablePeerAccess on a context that already has it enabled? If it is costly, then maybe the context should be marked as having peer access enabled so the call can be skipped in subsequent P2P operations.

So I believe that the call to cuDeviceCanAccessPeer is unnecessary for this use case since cuCtxEnablePeerAccess checks whether cuDeviceCanAccessPeer is false and is currently silently returning the corresponding error, before cuMemcpyPeerAsync is called anyway (making the copy via the host if cuDeviceCanAccessPeer is false).

As long as it returns fast and in a stable way when direct P2P support isn't possible, I agree that the cuDeviceCanAccessPeer would be redundant.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'piextP2P is not a very informative name. Maybe something like piextDevicesSupportP2P would be more telling?'

Thanks, I've made the suggested change.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of the overhead incurred. Maybe it is fine to keep it enabled for future cases. What is the cost of calling cuCtxEnablePeerAccess on a context that already has it enabled? If it is costly, then maybe the context should be marked as having peer access enabled so the call can be skipped in subsequent P2P operations.

I think that it is negligible but my tests so far haven't been comprehensive. I'm aiming to test on the Cori nodes using multiple GPU's and memory copy routes, so should have a clearer answer soon. I'll also investigate cuCtxDisablePeerAccess.

Copy link
Contributor Author

@JackAKirk JackAKirk Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this tensorflow issue it appears that when the nvidia docs say that 8 peers are supported this accounts for 8x8 peer access directions (Although there is a mention that up to 9 devices are actually supported in that tensorflow issue). There is still a problem with connecting multiple devices on Cori so I have not been able to verify the behaviour for up to 8 devices yet. I think that if more n > 8/9 devices are used then the n*n connections will result in CUDA_ERROR_TOO_MANY_PEERS unless there are calls to cuCtxDisablePeerAccess corresponding with each call to cuCtxEnablePeerAccess. I have experimented a little with cuCtxDisablePeerAccess and the obvious complication is that when the copy is asynchronous cuCtxDisablePeerAccess needs to wait on the memcopy call.
LLVM OpenMP has the same approach as used here whereby cuCtxEnablePeerAccess is called prior to every peer memory copy call without any later calls to cuCtxDisablePeerAccess. The difference in the LLVM OpenMP approach with respect to this PR is that if Peer access is unavailable they explicitly call memcpyDtoD (which itself does the copy via the D2H H2D). I found that this has no advantage over calling cuMemcpyPeerAsync everytime; the performance is identical.
The LLVM OpenMP implementation also outputs a debug message if cuDeviceCanAccessPeer returns false before continuing with the fallback. I'm not sure if such a debug mode message is desired here or not?

For two devices I have confirmed that repeated calls to cuCtxEnablePeerAccess are extremely fast: for multiple cuCtxEnablePeerAccess/cuMemcpyPeerAsync calls the timing of cuCtxEnablePeerAccess was the only one output in units of nanoseconds! Another option would be to do what it sounds like tensorflow does and call cuCtxEnablePeerAccess n*n times for n devices once the contexts are created. A problem with this is that PeerAccess can be enabled between devices when the user does not request/use the access.

Comment on lines 949 to 958
} else if (!Queue->is_host() && !Record->MCurContext->is_host()) {
bool p2p = false;
Queue->getPlugin().call<PiApiKind::piextP2P>(
Queue->getDeviceImplPtr()->getHandleRef(),
findAllocaCmd(Record)
->getQueue()
->getDeviceImplPtr()
->getHandleRef(),
&p2p);
if (!(p2p && Queue->get_device().get_platform().get_backend() ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens it we have a context/device from OpenCL GPU AMD and a context/device from OpenCL GPU Intel here?
How piextP2P function can tell if p2p is supported between those devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is currently implemented piextP2P does not take account of whether the two devices share the same backend. Currently it is not supported for P2P memory copy across devices not sharing the same backend (enforced the additional check on line 958).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romanovvlad has a point. If someone was to go in and implement piextP2P for one of the other backends that actually care about the devices being passed, it's not unreasonable to assume that both devices are valid devices from that backend and I don't think they'd have a good way of telling whether or not they are. I think it would make sense to do the backend check before calling piextP2P.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it makes sense. I made the change.

Comment on lines 454 to 459
} else {
Plugin.call<PiApiKind::piextEnqueueMemBufferCopyPeer>(
Queue, SrcMem, QueueDst, DstMem, SrcXOffBytes, DstXOffBytes,
SrcAccessRangeWidthBytes, DepEvents.size(), DepEvents.data(),
&OutEvent);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not think it's a good approach. I still think that having the same context as an indicator of that the memory can be copied without going thru host(from SYCL RT POV) is a good abstraction and it's not immediately clear why additional ways to bypass it are needed on this level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason that the Peer to Peer copy is made across contexts is because only a single device per context is currently supported for cuda backend: #4381.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest modifying cuda plugin/backend to support multiple devices per context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature could still be useful for doing explicit memory copies between accessors, as I don't know of any restriction requiring the parent contexts to be the same for both accessors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is, even though it might be a relatively infrequent operation it is gained performance for those that insist on making explicit memory copies, even after the CUDA backend supports multiple devices per context.

Copy link
Contributor

@romanovvlad romanovvlad Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature could still be useful for doing explicit memory copies between accessors,

Hm. It's interesting. Currently if accessors are from different context we make sure(move) all of them are in the context which the queue is bound to and then do the copy.

What would happen if we enable P2P copies in SYCL RT in this case? Which context resulting sycl::event would be bound to? Queue context or a context where data for src accessors is or where data for dst accessor is?

If both devices were in the same cuda plugin context this copy could be handled in the piEnqueueBufferCopy in the cuda plugin by either doing regular copy or P2P. So, it seems we can do at least the same if we go with "cuda plugin exposes several devices in the same context" approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if accessors are from different context we make sure(move) all of them are in the context which the queue is bound to and then do the copy.

I half-assumed it took this path as well. That is interesting!

What would happen if we enable P2P copies in SYCL RT in this case? Which context resulting sycl::event would be bound to? Queue context or a context where data for src accessors is or where data for dst accessor is?

I would think it would be on the queue, given the memory copy operation is "enqueued" on that queue. However, thinking about it again I am not sure how you'd even do explicit copy between accessors of different contexts, as device accessors are requested through the command group handle.

Instead, I think the place it would be useful is when migrating accessors between devices in the runtime (is that what is happening here?) I.e.

sycl::queue queueA, queueB; // Assume queueA and queueB have disjoint sets of devices.
buffer<int, 1> buff{range<1>{1024}};

queueA.submit([&](handler& cgh) {
  accessor accessorA{buff, cgh, read_write};
  // Do some work here
});

queueB.submit([&](handler& cgh) {
  accessor accessorB{buff, cgh, read_write}; // Should migrate data from queueA parent context to queueB parent context.
  // Do more work here
});

If both devices were in the same cuda plugin context this copy could be handled in the piEnqueueBufferCopy in the cuda plugin by either doing regular copy or P2P. So, it seems we can do at least the same if we go with "cuda plugin exposes several devices in the same context" approach.

I completely agree. The functionality exposed is likely also useful for when support for multiple devices per context is implemented in the CUDA backend. However, like in the example mentioned above it is the user that have (implicitly) ordered a data movement between two different contexts, which the backend cannot change. We should optimize for this case as well, if we can.

JackAKirk added 3 commits August 25, 2021 15:16
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Included header for backend_impl.hpp in graph_builder.cpp so that detail::getImplBackend may be called.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Comment on lines 140 to 142
_PI_API(piextEnqueueMemBufferCopyPeer)
_PI_API(piextEnqueueMemBufferCopyRectPeer)
_PI_API(piextEnqueueMemImageCopyPeer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these new API? why wouldn't regular copy API perform P2P copies transparently under the hood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular API takes a single pi_queue whereas the Peer API requires a second queue as an argument (principally so that the second context is known). The regular API is an OpenCL interface so cannot be changed. I think that a single API could be used if the new piext*** API was used in the runtime to replace the regular copy API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Peer API requires a second queue as an argument (principally so that the second context is known).

The pi_mem src & dst are created with interfaces that have context, e.g. piextUSMDeviceAlloc or piMemBufferCreate, so backends already know the context of both src and dst, and can act on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I missed that when I checked out pi_mem. There is another reason for providing both queues from e.g. this snippet in
cuda_piextEnqueueMemBufferCopyRectPeer in pi_cuda.cpp line 4054:

  try {
    ScopedContext active(dst_queue->get_context());
    if (event_wait_list) {
      retErr = cuda_piEnqueueEventsWait(src_queue, num_events_in_wait_list,
                                        event_wait_list, nullptr);
    }

    if (event) {
      retImplEv = std::unique_ptr<_pi_event>(_pi_event::make_native(
          PI_COMMAND_TYPE_MEM_BUFFER_COPY_RECT, dst_queue));
      retImplEv->start();
    }

We wait on events associated with the source queue and return the event associated with the dest queue.
There were problems associated with returning the event associated with the src queue. Since the contexts can be found from pi_mem I will look at the again and see if there is a way of doing things without the second queue argument.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that it is fine to only pass a single queue in the PI which acts as the command_queue, and it is fine for either the src_queue or the dst_queue to act as the command queue.

It is my current understanding that all implementation details of implicit peer to peer memory copy calls for buffer memory between devices sharing a SYCL context should be dealt with by the PI, such that the only implicit peer to peer memory copy case that should be dealt with by the runtime (via memory_manager) is the cross context case.

I will implement the peer to peer via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

Copy link
Contributor Author

@JackAKirk JackAKirk Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now made the changes that I described above, implementing the peer to peer copy via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

_PI_API(piextEnqueueMemBufferCopyPeer)
_PI_API(piextEnqueueMemBufferCopyRectPeer)
_PI_API(piextEnqueueMemImageCopyPeer)
_PI_API(piextDevicesSupportP2P)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we re-use existing piDeviceGetInfo with a new PI_DEVICE_INFO_P2P_DEVICES, which would return all devices that have P2P connection to this device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is any cuda/hip driver API that takes a device as an argument and returns information on P2P connection status with all other devices. See https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html for the cuda API's. I think that any plugin API attempting to return all devices that have P2P connection to a given device would require a list of all other devices (or equivalent) as an argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is any cuda/hip driver API that takes a device as an argument and returns information on P2P connection status with all other devices.

I don't think so either, but you should be able to get back to the platform from your PI device and get a list of all devices, then you can check for each device if access is allowed, or in the case of the CUDA backend just return the same list without the corresponding device in it.

As an additional note, maybe in that case it would make sense to make it a bit more granular, e.g. having two new queries PI_DEVICE_INFO_P2P_READ_DEVICES and PI_DEVICE_INFO_P2P_WRITE_DEVICES (or something similar) that allows for monodirectional interconnects, just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine in the case that P2P copies are forbidden between devices not sharing the same platform. If cross context P2P copies are allowed wouldn't we need to access a list of all platforms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'm not convinced that cross-platform P2P should be allowed, but I suppose a backend can make that decision. Currently the CUDA backend only has one platform so it should be sufficient to just use that one. If at some point the CUDA backend introduces more platforms and P2P should be allowed between them, calling piPlatformsGet or replicating the internal logic should get you all platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. Sure I can do it like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this now. If the change is acceptable, and providing we keep the cross context memory copy runtime impl, then I will implement piDeviceGetInfo for the other PI's that don't have it, since all PI's will call it now in graph_builder.

JackAKirk and others added 8 commits August 26, 2021 13:36
… of readable peer devices.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
the srcQueue can be as the command_queue for peer to peer copy.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@bader bader added the cuda CUDA back-end label Sep 3, 2021
@JackAKirk JackAKirk changed the title [WIP][CUDA] P2P buffer/image memory copy [CUDA] P2P buffer/image memory copy Oct 1, 2021
Comment on lines +516 to +517
// P2P is currently unsupported in level zero
case PI_DEVICE_INFO_P2P_READ_DEVICES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// P2P is currently unsupported in level zero
case PI_DEVICE_INFO_P2P_READ_DEVICES:
case PI_DEVICE_INFO_P2P_READ_DEVICES:
// P2P is currently unsupported in level zero

Makes the association clearer and more consistent with other cases. Similar applies to sycl/plugins/hip/pi_hip.cpp and sycl/plugins/level_zero/pi_level_zero.cpp.


for (const auto &dev : device->get_platform()->devices_) {
devs.emplace_back(dev.get());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #4571 this will only report itself as supporting P2P as each device has its own platform, right?

If the intention is still to report all devices as they can do explicit P2P memory copies between them, even though some go through host, I wonder if PI_DEVICE_INFO_P2P_READ_DEVICES is an ambiguous name as it could also imply that the device can access the other devices memory directly, which would require cuDeviceCanAccessPeer checks.

Copy link
Contributor Author

@JackAKirk JackAKirk Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #4571 this will only report itself as supporting P2P as each device has its own platform, right?

Yes. After #4571 this PR no longer acts as a short term fix for enabling direct P2P copies via the cross context route, before the CUDA PI is adapted to allow multiple cuda contexts per sycl context, allowing the preferred intra-context P2P route. #4571 can only be reverted once it becomes possible to do the P2P copy between devices sharing a sycl context. It still has the value in the cross-context optimization you described below - If for whatever reason a user decides to split backend devices/contexts into separate sycl contexts, then we can also optimize this case. Before the change introduced in #4751 is reverted this PR will simply do nothing.
I think that the most interesting remaining aspect of this PR is that it has revealed a lot of apparently redundant code which complicates graph_builder and commands via the unused ConnectionCmd return pointer. This appears to be a relic of an earlier obsolete implementation: now the job of making sure the memory is in the correct context always appears to be carried out in graph_builder::addCG.

If the intention is still to report all devices as they can do explicit P2P memory copies between them, even though some go through host, I wonder if PI_DEVICE_INFO_P2P_READ_DEVICES is an ambiguous name as it could also imply that the device can access the other devices memory directly, which would require cuDeviceCanAccessPeer checks.

Yes I can try to pick a better name. 'PI_DEVICE_INFO_P2P_READ_DEVICES' returns a set of devices which will be a superset of the set of devices that are capable of doing P2P, since it can include cases that will revert to the Device to Host to Device route. It is also a query which is not exposed to the user, so I think that a slightly longer, uglier, but more descriptive name would be OK: particularly since we don't want to confuse it with future queries: 'PI_DEVICE_INFO_P2P_READ_DEVICES' is probably more appropriate to reserve for a hypothetical future user facing query that tells the user whether a device can access the peers memory directly.

@steffenlarsen
Copy link
Contributor

I still think the right approach would be to modify the cuda plugin

  1. to support multiple devices in the same context.
  2. to report devices that support efficient copy in the same platform.
  3. modify SYCL RT to issue piMigrateMem when contexts match but devices are different.

I am not convinced that these have to be mutually exclusive. This handles cases where the runtime needs to do inter-context copies, no matter the backend. You could have two contexts with the same platform with different devices that can do P2P copies so this would be an optimization of that case.

But, whatever option we take we need to make sure it works for Level Zero.

Based on the Level Zero specification there should be a sufficient set of P2P operations to support the changes proposed here. From what I can tell, the behavior of the runtime should stay the same for backends that don't implement PI_DEVICE_INFO_P2P_READ_DEVICES.

@JackAKirk
Copy link
Contributor Author

@romanovvlad
@alexbatashev
@smaslov-intel

We'd like to resolve this PR to get the PI extension for peer-to-peer copies merged, so we propose either merging this PR as-is (some small improvements can be added to address the remaining naming issue) as a temporary solution, or removing the SYCL runtime part and have the PI extension unused until we have multi-device SYCL context support.

As discussed we believe that ConnectionCmd can be safely removed, since no tests fail when Command::processDepEvent returns nullptr always. As we understand it, all possible graphs that require memory moves across contexts are already handled in GraphBuilder::addCG. Although it apparently does nothing in the latest DPC++, the existence of ConnectionCmd can lead to unexpected behavior when changes to GraphBuilder::addCG are made, such as those in this PR. Its removal would considerably declutter the GraphBuilder and Commands classes, making the scheduling model easier to understand and modify.
If you would prefer that we address the ConnectionCmd problem in a separate issue then we can do that.

Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@romanovvlad
Copy link
Contributor

We'd like to resolve this PR to get the PI extension for peer-to-peer copies merged, so we propose either merging this PR as-is (some small improvements can be added to address the remaining naming issue) as a temporary solution, or removing the SYCL runtime part and have the PI extension unused until we have multi-device SYCL context support.

Could you please tell if there is any value in merging PI extension(cuda plugin impl) without doing changes in the SYCL RT?
I do not mind if in scope of this PR we merge plugin changes only and have a separate PR for changes around ConnectionCmd and maybe for change which uses PI extension in SYCL RT(as a short term) for discussion.

@@ -296,7 +296,8 @@ typedef enum {
PI_DEVICE_INFO_MAX_MEM_BANDWIDTH = 0x10026,
PI_DEVICE_INFO_IMAGE_SRGB = 0x10027,
PI_DEVICE_INFO_ATOMIC_64 = 0x10110,
PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111
PI_DEVICE_INFO_ATOMIC_MEMORY_ORDER_CAPABILITIES = 0x10111,
PI_DEVICE_INFO_P2P_READ_DEVICES = 0x10112
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a description for this here?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 28, 2021

Could you please tell if there is any value in merging PI extension(cuda plugin impl) without doing changes in the SYCL RT? I do not mind if in scope of this PR we merge plugin changes only and have a separate PR for changes around ConnectionCmd and maybe for change which uses PI extension in SYCL RT(as a short term) for discussion.

Merging the PI CUDA part of this PR (minus PI_DEVICE_INFO_P2P_READ_DEVICES) will have no effect on DPC++ without the corresponding changes to the RT that make use of it, but would allow the non-contentious part of this PR to be merged, simplifying any future PR that enables P2P copy usage in the RT.

Copy link
Contributor

@againull againull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PI changes look good to me.

@AerialMantis
Copy link
Contributor

This PR is on hold until multi-device context support is added (#4381) so has been inactive for a while, but we will resume this once that is done.

@AerialMantis AerialMantis reopened this Jun 1, 2022
@github-actions github-actions bot removed the Stale label Jun 2, 2022
@JackAKirk JackAKirk marked this pull request as draft November 28, 2022 11:45
@JackAKirk JackAKirk closed this May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants