Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Doc] extension proposal and Impl for ext_oneapi_P2P. #4332

Closed
wants to merge 4 commits into from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Aug 13, 2021

This extension introduces a new platform info descriptor, 'ext_oneapi_P2P', which returns a boolean value indicating whether Peer to Peer memory copies are supported by the platform vendor.

The immediate motivation for this extension is that it is a prerequisite for enabling P2P memcpy across distinct devices for CUDA. See sycl...JackAKirk:P2P and for corresponding tests: intel/llvm-test-suite@intel...JackAKirk:P2P_cuda_tests. This extension was chosen to ensure that other backends (such as ROCm) could use P2P memory copies in the future, without needing to change the runtime code other than to update the ext_oneapi_P2P platform query.

The branch implementing cuda P2P copies in the PI (sycl...JackAKirk:P2P) uses this extension in two places:

  1. sycl/source/detail/scheduler/graph_builder.cpp line 945: the P2P platform descriptor is used to determine whether the following code is executed:
NeedMemMoveToHost = true;

This determines whether or not a memory move to host is necessary. When NeedMemMoveToHost is set false a direct Peer to Peer memory copy is performed.

  1. sycl/source/detail/scheduler/commands.cpp line 512: The P2P platform descriptor is used by a redundancy check to determine whether a connection command should be added in order to ensure that memory is located in the correct context. When P2P memcpy is enabled, it is no longer necessary to copy memory via the host for memory copies between SYCL contexts provided that both source and destination devices correspond to the same vendor.

JackAKirk and others added 4 commits August 13, 2021 13:40
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
renamed P2P -> ext_oneapi_P2P.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk requested review from smaslov-intel and a team as code owners August 13, 2021 15:35
@JackAKirk JackAKirk changed the title [SYCL][EXT] extension proposal and Impl for ext_oneapi_P2P. [SYCL][Doc] extension proposal and Impl for ext_oneapi_P2P. Aug 13, 2021
@@ -126,7 +126,8 @@ typedef enum {
PI_PLATFORM_INFO_NAME = CL_PLATFORM_NAME,
PI_PLATFORM_INFO_PROFILE = CL_PLATFORM_PROFILE,
PI_PLATFORM_INFO_VENDOR = CL_PLATFORM_VENDOR,
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION,
PI_PLATFORM_INFO_ext_oneapi_P2P = 0x40110
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure whether 'ext_oneapi_P2P' is the preferred name for such an extension. I imagine that 'P2P' is not favoured since it does not indicate that it is an extension. I can also use the capitalization, e.g. 'EXT_ONEAPI_P2P' once the preferred name is confirmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've added many device info extensions already, see

// These are Intel-specific extensions.

No naming convention, just use all capital case. The important part is to pick a value that isn't used by anything else already.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal adds a feature to SYCL that is intended to be used by the runtime only, but the information will be available user-side. As such I think it would be better to not have the SYCL interface for it and just keep the new PI features, having the runtime use those directly.

That said, knowing whether or not efficient P2P copying is available can be useful information for a user, but I am not convinced that a single yes/no from the platform is enough to give sufficient information.
Say for example we have 3 GPUs in a system, 2 of which are connected with some high-speed interconnect interface. Even though they may all be able to transfer P2P, only 2 of them can communicate with a high-speed interconnect. A user might want to know exactly which two are interconnected.
For such a query I can think of two options:

  1. Add a new query function that takes two devices and returns whether or not they are interconnected.
  2. Have a platform and/or context info descriptor for querying all pairs of devices that are interconnected. Note that interconnects may not be bi-directional.

Additionally, as an alternative to a yes/no whether two devices are interconnected, it could carry a value representing the speed of the interconnect relative to device-host-device transfers or something along those lines.

Pinging @gmlueck

Comment on lines +55 to +59
std::string vendor_name =
get_platform_info<string_class, info::platform::vendor>::get(plt,
Plugin);
bool result = (vendor_name == "NVIDIA Corporation") ? true : false;
return result;
Copy link
Contributor

@steffenlarsen steffenlarsen Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior should be handled by the backends, i.e.

Suggested change
std::string vendor_name =
get_platform_info<string_class, info::platform::vendor>::get(plt,
Plugin);
bool result = (vendor_name == "NVIDIA Corporation") ? true : false;
return result;
bool SupportsP2P = false;
Plugin.call<PiApiKind::piPlatformGetInfo>(
plt, pi::cast<pi_platform_info>(info::platform::ext_oneapi_P2P), sizeof(bool), &SupportsP2P,
nullptr);
return SupportsP2P;

and then implement cases for PI_PLATFORM_INFO_ext_oneapi_P2P in piPlatformGetInfo.

@gmlueck
Copy link
Contributor

gmlueck commented Aug 16, 2021

This proposal adds a feature to SYCL that is intended to be used by the runtime only, but the information will be available user-side. As such I think it would be better to not have the SYCL interface for it and just keep the new PI features, having the runtime use those directly.

I have the same concerns as @steffenlarsen. This API seems too simplistic for the general case. If the immediate needs is only to have some query to use internally in the DPC++ runtime, then I think it would be better to extend the "PI" interface without adding a user-visible extension.

If we decide to only extend the PI interface, it's not clear that we need to extend "enum class platform". I presume there is a way to extend the PI layer without changing this user-visible enum. Tagging @smaslov-intel who knows better about how to extend the PI layer with a new query.

Also tagging @jbrodman because he may have thoughts on what a general user-facing API should look like to query whether two devices can communicate efficiently via P2P.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Aug 17, 2021

Thanks for the comments. I see that the user API introduced here has limited value for the user and could also be confusing.

As I understand it a formal extension would not be necessary provided that the query is only used internally in the runtime. I am therefore happy to close this PR and make changes that don't require an extension directly to an upcoming PR for cuda P2P copies if that is OK?

I have investigated further and it is possible to call the plugin directly at the locations in scheduler.cpp and commands.cpp that I specified in my original comment without exposing the user to a new get_info descriptor. I think that the cleanest way to do this requires switching from a platform info descriptor to a context info descriptor.
It requires either implementing piContextGetInfo including the new P2P case in the other backends or silently catching returned implementation errors at the locations in scheduler.cpp and commands.cpp where the plugin calls are made. Since I cannot test all backends on my local machine I will need to check that the test coverage is sufficient to catch any potential bugs with any new implementations of piContextGetInfo in other backends. Alternatively since the only case for which P2P copies will be initially supported is for the cuda backend, I could also do a check for the cuda backend directly in scheduler.cpp and commands.cpp.

[width="40%",frame="topbot",options="header,footer"]
|======================
|Platform descriptors |Return type |Description
|info::platform::ext_oneapi_P2P | bool| Returns whether the platform supports Peer to Peer memory copies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this check a more fine-grained, i.e. check if 2 given devices have P2P capabilities? I think it may happen that devices from the same platform have no P2P link, or devices from different platform (of the same backend) do have P2P. Here is how similar query is defined in Level-Zero API: https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=zedevicegetp2pproperties#_CPPv424zeDeviceGetP2PProperties18ze_device_handle_t18ze_device_handle_tP26ze_device_p2p_properties_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the corresponding cuda check would use cuDeviceCanAccessPeer(int* canAccessPeer, CUdevice dev, CUdevice peerDev) which is summarized as 'Queries if a device may directly access a peer device's memory. '

However for our current purposes this finegrained binary device check isn't what we need. A more complete description of the P2P descriptor as it is intended to be used within the runtime is:
'P2P' descriptor:
'Indicates whether the backend used by the platform/context has an API which can perform P2P memory copies if the device topology allows; in the case that a pair of devices using the backend are not directly connected via PCIe or otherwise the API is lowered to a pair of memory copies, the first of which copies from the first device to host, the second of which copies from the host to the second device'

For the cuda API comprising cuMemcpyPeerAsync/cuMemcpy3DPeerAsync, in the case that P2P is not available and the second route is taken, the memory copy can be a few times faster using the API compared to the currently existing case whereby two separate sycl events are created corresponding to two separate cuda memcpy function calls (for P2H and H2P) directly. For small array sizes there is little different in speed whether the P2H H2P copy is controlled by the P2P API or called directly by the runtime. Quantitatively the speedup appears to depend on whether there is a memory copy between separate devices (peers) or whether there is a move between contexts on a single device (which also requires calling cuMemcpyPeerAsync etc or making an explicit D2H H2D copy pair), but qualitatively the behaviour is the same so that for all use cases calling the cu Peer memcpy API is preferred.

Apologies, I should have given a more thorough description originally. In any case I agree that the descriptor 'P2P' as defined above is probably of limited use to the user and could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether the backend used by the platform/context has an API which can perform P2P memory copies if the device topology allows

So, is the answer is "true, backend has P2P capabilities", then you still need to check P2P capabilities of each individual pair of devices involved. Why not just have the device-level query, which would return "false" for any devices under a backend not having P2P API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romanovvlad @smaslov-intel @steffenlarsen @AerialMantis
So for the cuda backend I don't think it is necessary to explicitly check P2P capabilities of each individual pair of devices involved. The way I have implemented it initially is to call cuCtxEnablePeerAccess which as described in the docs will return an error that can be silently handled if cuDeviceCanAccessPeer returns false (meaning that the connection topology does not allow P2P access). Then whether cuDeviceCanAccessPeer is either true or false the cuda driver API Peer copy functions can be called (if cuDeviceCanAccessPeer is false it will do P2H H2P and if true there will be a single P2P copy). Since the HIP/ROCm API is very similar to the cuda API we think that it may possibly behave similarly, but we have not confirmed this yet. However, if Level Zero does require a device-level query then it makes sense to have this, even if the CUDA backend doesn't check at this granularity.
There could be a binary PI query which takes the two devices as arguments, returns true in the case that both devices use the cuda backend, does a binary device check for e.g. level zero using zeDeviceGetP2PProperties or similar if such a check is required. This wouldn't be a big problem and it sounds like this is what will be needed. Although I think there is a small but non-trivial technical issue with such a PI query taking two devices as arguments that is worth mentioning:

The binary device P2P query would be first needed in graph_builder::addCG. In addition to the Source device the query would need to access the Destination device which I think can only be immediately accessed via the 'Record' variable. The destination queue/device can be found from Record in the same way as is done in graph_builder::insertMemoryMove. Perhaps there is a cleaner way but I don't see it currently.
The bigger problem is that there is a kind of 'redundancy check' that adds a connection command if the memory is not in the correct context in Command::processDepEvent that is triggered when the graph_builder::addCG is adjusted to make a direct P2P memory copy instead of the D2H H2D route. There would need to be another binary device query at this point in the code and currently I don't believe there is any immediate way to access the source device via e.g. the DepEvent instance in Command::processDepEvent which cannot uniquely identify the source device when there is more than one device associated with the 'source' context.
One question perhaps someone can answer is what use case this 'redundancy check ' is for? Removing it does not lead to any failed tests using the cuda backend. My current understanding is that any case where memory is not in the correct context is already handled in graph_builder::addCG (The only case I know of is when a second queue submits a command group using memory located on a device of a primary queue that has a different context). The question is therefore whether there is any circumstance that Command::processDepEvent is called with the following if clause returning true, then creating the connection command, without the Command instance being created from graph_builder::insertMemoryMove via graph_builder::addCG

  if (DepEventContext != WorkerContext && !WorkerContext->is_host()) {
    Scheduler::GraphBuilder &GB = Scheduler::getInstance().MGraphBuilder;
    ConnectionCmd = GB.connectDepEvent(this, DepEvent, Dep);
  }

The other important point mentioned is that it is currently undesirable that only a single cuda (I believe also ROCm) device/context (cuda uses only one device per context) can be held by a single sycl context, whereas for OpenCL it is normal that only a single SYCL context is used for all devices. See here for a full description and resolution discussion of this issue. It sounds like it is a sizeable technical challenge to change the runtime and cuda PI so that a single SYCL context can hold more than one cuda context/device, so that at least for the short term the only means of using multiple cuda devices is via multiple SYCL contexts.

How would you recommend we proceed with this? I am assuming that inter-context memory copy would still be supported in the case that all backends use a single SYCL context for several devices (at least via the host): presumably it would be necessary in the case that more than one backend is used simultaneously, for memory copy between devices with different backends (and different SYCL contexts). If inter-context memory copy would then still be supported for devices sharing the same backend then I think it would make sense to make use of the faster Peer to Peer copies for the cases where this is supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best course of action would be to scrap the extension and focus on the PI/runtime side of this.

The problem of having multiple CUDA contexts in a single SYCL context is an enormous can of worms and though related I would argue it's not the main focus of the changes you are introducing.

If the plan is to implement this in tandem with a P2P copy operation in PI, I suggest you implement another PI function for checking if two devices can do P2P copy from each other. Having the finer granularity potentially helps other backends. For the CUDA backend this query could always return true as the P2P copy in the CUDA driver API is able to make P2P copies even when the devices aren't interconnected. For other backends you can just leave them returning false (with a note to implement them correctly in the future). If the new P2P support query returns true you can then utilize the associated P2P copy functions, and if the query returns false then make the copy as a round-trip through the host, which I believe is the current implementation.

In short, I suggest you close this PR and implement the P2P support query as a new PI function that checks if the backend can P2P copy between two devices (or contexts?). This does not need an extension document. Then open a new PR with this, potentially together with the P2P copy function (and it being used by the runtime if you'd like.)

If you have any questions along the way, feel free to reach out or open a discussion topic here on Github.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. The new PR is here: #4401. I will not close this PR.

@romanovvlad
Copy link
Contributor

@JackAKirk
Can we think of another approach to utilize P2P copies? Instead of allowing direct copies between devices in different context, allow a set of devices that support P2P copies to be in the same context?

@smaslov-intel
Copy link
Contributor

allow a set of devices that support P2P copies to be in the same context

But this is already allowed today, the only requirement for devices to be in the same context is for them to be on the same platform. Did you mean we would disallow devices without P2P capabilities to be in the same context, instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants