[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

JackAKirk · 2023-02-10T15:43:44Z

This implements the current extension doc from #6104 in the CUDA backend only.

Fixes #7543.
Fixes #6749.

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

Older versions of gcc struggle with attributes on namespaces

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

gmlueck · 2023-02-10T18:08:39Z

This implements the current extension doc from #6104 (minus peer_access::access_enabled because it isn't natively supported by CUDA)

This should be resolved. We want our extensions to be fully implemented on all backends. If this part of the API cannot be implemented on CUDA, we should remove it from the extension spec. However, I thought it could be implemented by simply keeping track in software whether P2P has been enabled for each device. Don't we need that anyway in order to diagnose errors correctly?

zjin-lcf · 2023-02-21T21:49:46Z

@JackAKirk I reported one of the issues. Is there some test program for me to execute ? Thanks.

JackAKirk · 2023-02-24T17:28:18Z

@JackAKirk I reported one of the issues. Is there some test program for me to execute ? Thanks.

Yes, here are the two main use cases that I have been testing:
intel/llvm-test-suite@intel...JackAKirk:llvm-test-suite:p2p_examples
I've just cleaned them up a bit, hopefully they still compile OK. I haven't explicitly checked.

You can turn the access on and off and observe how this affects the P2P usage with nsys, but obviously you need two devices. Note that one interesting thing is that for the cuda backend the kernel access is unidirectional, but the P2P copies are bidirectional: if I enable p2p access of device 1 from device 0 then I can do P2P copies both ways, but P2P access only the direction I specified. I did not find any Nvidia documentation that explains this.

The P2P query function and how errors are handled is still subject to change in the specification.

zjin-lcf · 2023-02-27T14:53:56Z

Thanks! I tried to run the modified program, but the result of the P2P memory copy is not right after P2P copy is enabled. Not sure if this is reproducible.

#include <cassert>
#include <memory>
#include <sycl/sycl.hpp>

using namespace sycl;

int main() {

  std::vector<sycl::device> Devs;

  // Note that this code is temporary due to the temporary lack of multiple devices per sycl context in the nvidia backend.
  ////////////////////////
  for (const auto &plt : sycl::platform::get_platforms()) {

    if (plt.get_backend() == sycl::backend::cuda)
      Devs.push_back(plt.get_devices()[0]);
  }
  ////////////////////////

  ///// Enable bi-directional peer copies
  Devs[0].ext_oneapi_enable_peer_access(Devs[1]);

  std::vector<sycl::queue> Queues;
  std::transform(Devs.begin(), Devs.end(), std::back_inserter(Queues),
      [](const sycl::device &D) { return sycl::queue{D}; });

  assert(Queues.size() > 1);

  int N = 100;
  int *input = (int *)malloc(sizeof(int) * N);
  for (int i = 0; i < N; i++) {
    input[i] = i;
  }

  int *arr0 = malloc<int>(N, Queues[0], usm::alloc::device);
  Queues[0].memcpy(arr0, input, N * sizeof(int)).wait();

  int *arr1 = malloc<int>(N, Queues[1], usm::alloc::device);

  // Copy device usm allocated in devices/cuContexts
  //Queues[0].copy(arr1, arr0, N).wait();
  Queues[1].copy(arr1, arr0, N).wait();
                                                          
  int *out;
  out = new int[N];
  //Queues[0].copy(out, arr1, N).wait();
  Queues[1].copy(out, arr1, N).wait();

  sycl::free(arr0, Queues[0]);
  sycl::free(arr1, Queues[1]);

  bool ok = true;
  for (int i = 0; i < N; i++) {
    if (out[i] != input[i]) {
      printf("%d %d\n", out[i], input[i]);
      ok = false; //break;
    }
  }
  delete[] out;

  printf("%s\n", ok ? "PASS" : "FAIL");

  return 0;
}

JackAKirk · 2023-02-28T16:35:04Z

Thanks! I tried to run the modified program, but the result of the P2P memory copy is not right after P2P copy is enabled. Not sure if this is reproducible.

I've fixed it here: https://github.com/intel/llvm-test-suite/compare/intel...JackAKirk:llvm-test-suite:p2p_examples?expand=1
I had swapped src and dst in the memcpy calls. For some reason the spec has the order of them swapped wrt copy.
Thanks

gmlueck · 2023-02-28T16:56:13Z

I had swapped src and dst in the memcpy calls. For some reason the spec has the order of them swapped wrt copy.

You are noting that memcpy has a different parameter order from copy? This done on purpose to align with standard C++ functions. Standard C++ functions named "copy" have the source operand first and the destination operand second.

https://en.cppreference.com/w/cpp/algorithm/copy
https://en.cppreference.com/w/cpp/filesystem/copy

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

zjin-lcf · 2023-03-03T20:33:57Z

The updated p2p example in SYCL might be helpful for you.

https://github.com/zjin-lcf/HeCBench/tree/master/p2p-sycl

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

jandres742 · 2023-07-06T18:37:07Z

@jandres742 @gmlueck @smaslov-intel are you OK with these latest changes?

What is the implementation status of this extension as of this PR? Is it fully implemented on CUDA? Is it implemented on other backends too? (I see changes to the Level Zero backend, for example.)

I see that the extension document is still in the "proposed" directory. Is it time to move it to "supported"?

@gmlueck : we will add the support to the L0 backend in a follow-up patch.

jandres742

+1 on L0 and UR common code.

JackAKirk · 2023-07-06T18:54:19Z

@jandres742 @gmlueck @smaslov-intel are you OK with these latest changes?

What is the implementation status of this extension as of this PR? Is it fully implemented on CUDA? Is it implemented on other backends too? (I see changes to the Level Zero backend, for example.)

I see that the extension document is still in the "proposed" directory. Is it time to move it to "supported"?

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

gmlueck · 2023-07-06T21:07:40Z

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

Is someone scheduled to do the remaining work soon? If yes, we can delay moving the spec until that happens. If there are no immediate plans, we should move the document in this PR so that CUDA users know that extension is available.

JackAKirk · 2023-07-07T08:50:08Z

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

Is someone scheduled to do the remaining work soon? If yes, we can delay moving the spec until that happens. If there are no immediate plans, we should move the document in this PR so that CUDA users know that extension is available.

I don't know if someone is scheduled to add the impl for l0 soon or not. It should be very simple but will require a system of two gpus or more for verification. I can move the document in this PR.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

gmlueck · 2023-03-22T21:52:43Z

sycl/plugins/level_zero/pi_level_zero.cpp

+  std::ignore = param_value_size_ret;
+
+  die("piextPeerAccessGetInfo not "
+      "implemented in L0");


Rather than die, shouldn't we return some sort of "false" status, indicating that P2P isn't available (yet)? That way we can document this extension as "supported", and we can enable end-to-end tests on all backends.

Same for the other backends.

gmlueck · 2023-07-07T11:46:10Z

sycl/doc/extensions/supported/sycl_ext_oneapi_peer_access.asciidoc

Please make the following changes to the API specification:

Update the "Status" section using the wording in the template.

Add a section "Backend support status" noting that this extension is supported only for the CUDA backend. I'd suggest wording like:

This extension is currently implemented in DPC++ for all devices and backends, however, only the CUDA backend allows peer to peer memory access. Other backends report false from the ext_oneapi_can_access_peer query.

Thanks for the suggestion, I've made these changes now.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

gmlueck · 2023-07-07T14:23:42Z

sycl/doc/extensions/supported/sycl_ext_oneapi_peer_access.asciidoc

+
+This extension is currently implemented in DPC++ for all GPU devices and
+backends, however, only the CUDA backend allows peer to peer memory access.
+Other backends report false from the ext_oneapi_can_access_peer query.


Suggested change

Other backends report false from the ext_oneapi_can_access_peer query.

Other backends report false from the `ext_oneapi_can_access_peer query`.

Code font is better here.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2023-07-10T09:27:34Z

Any more reviews for this?

JackAKirk · 2023-07-10T15:39:30Z

@smaslov-intel can this be merged?

smaslov-intel

LGTM, @intel/llvm-gatekeepers would merge

) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

npmiller and others added 7 commits February 6, 2023 14:30

[SYCL][CUDA] Move deprecation warning to class

8685475

Older versions of gcc struggle with attributes on namespaces

Initial P2P impl.

de16f88

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

added ext_oneapi_disable_peer_access and ext_oneapi_can_access_peer.

b5f9481

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Introduced pi_peer_attr.

64ecf25

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Format.

15d4bf6

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Merge branch 'sycl' into P2P-primary-ctxt

a35294f

JackAKirk requested review from jbrodman and gmlueck February 10, 2023 15:43

JackAKirk changed the title ~~[SYCL][CUDA] CUDA backend impl of ONEAPI P2P extension.~~ [SYCL][CUDA] CUDA backend impl of ONEAPI USM P2P extension. Feb 10, 2023

JackAKirk changed the title ~~[SYCL][CUDA] CUDA backend impl of ONEAPI USM P2P extension.~~ [SYCL][CUDA] backend impl of ONEAPI USM P2P extension. Mar 3, 2023

JackAKirk added 3 commits March 3, 2023 09:34

Format.

df55a69

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Format.

ddca3c3

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Format.

c3a2009

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws March 3, 2023 18:27 — with GitHub Actions Inactive

JackAKirk mentioned this pull request Mar 3, 2023

[SYCL][CUDA] Introduced USM P2P tests intel/llvm-test-suite#1631

Draft

Merge branch 'sycl' into P2P-primary-ctxt

f0f448d

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws March 3, 2023 19:51 — with GitHub Actions Inactive

Corrected hip pi die function.

1855367

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws March 3, 2023 20:44 — with GitHub Actions Inactive

JackAKirk added 2 commits March 6, 2023 02:59

Added esimd p2p pi functions.

644c880

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

fix mistake in last commit.

e5b421e

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws March 6, 2023 11:55 — with GitHub Actions Inactive

jandres742 approved these changes Jul 6, 2023

View reviewed changes

Moved p2p ext doc to supported.

c389980

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk requested a review from a team as a code owner July 7, 2023 08:55

JackAKirk temporarily deployed to aws July 7, 2023 09:11 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws July 7, 2023 10:21 — with GitHub Actions Inactive

gmlueck reviewed Jul 7, 2023

View reviewed changes

Added Backend support status, updated status.

8bd6b60

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws July 7, 2023 13:09 — with GitHub Actions Inactive

JackAKirk added 2 commits July 7, 2023 14:11

Updated sycl 2020 revision version.

5e7d821

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Switch to Greg's suggested wording.

ab3ac25

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws July 7, 2023 13:30 — with GitHub Actions Inactive

gmlueck reviewed Jul 7, 2023

View reviewed changes

Use code font for function name.

47acd23

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

gmlueck approved these changes Jul 7, 2023

View reviewed changes

JackAKirk temporarily deployed to aws July 7, 2023 15:01 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws July 7, 2023 16:05 — with GitHub Actions Inactive

Merge branch 'sycl' into P2P-primary-ctxt

4ab6215

JackAKirk temporarily deployed to aws July 10, 2023 09:36 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws July 10, 2023 10:15 — with GitHub Actions Inactive

smaslov-intel approved these changes Jul 10, 2023

View reviewed changes

dm-vodopyanov changed the title ~~[SYCL][CUDA] backend impl of ONEAPI USM P2P extension.~~ [SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension Jul 10, 2023

dm-vodopyanov merged commit 62ecb84 into intel:sycl Jul 10, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

JackAKirk commented Feb 10, 2023 •

edited

Loading

gmlueck commented Feb 10, 2023

zjin-lcf commented Feb 21, 2023

JackAKirk commented Feb 24, 2023

zjin-lcf commented Feb 27, 2023

JackAKirk commented Feb 28, 2023

gmlueck commented Feb 28, 2023

zjin-lcf commented Mar 3, 2023

jandres742 commented Jul 6, 2023

jandres742 left a comment

JackAKirk commented Jul 6, 2023

gmlueck commented Jul 6, 2023

JackAKirk commented Jul 7, 2023

gmlueck Mar 22, 2023

gmlueck Jul 7, 2023

JackAKirk Jul 7, 2023

gmlueck Jul 7, 2023

JackAKirk commented Jul 10, 2023

JackAKirk commented Jul 10, 2023

smaslov-intel left a comment

	Other backends report false from the ext_oneapi_can_access_peer query.
	Other backends report false from the `ext_oneapi_can_access_peer query`.

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

Conversation

JackAKirk commented Feb 10, 2023 • edited Loading

gmlueck commented Feb 10, 2023

zjin-lcf commented Feb 21, 2023

JackAKirk commented Feb 24, 2023

zjin-lcf commented Feb 27, 2023

JackAKirk commented Feb 28, 2023

gmlueck commented Feb 28, 2023

zjin-lcf commented Mar 3, 2023

jandres742 commented Jul 6, 2023

jandres742 left a comment

Choose a reason for hiding this comment

JackAKirk commented Jul 6, 2023

gmlueck commented Jul 6, 2023

JackAKirk commented Jul 7, 2023

gmlueck Mar 22, 2023

Choose a reason for hiding this comment

gmlueck Jul 7, 2023

Choose a reason for hiding this comment

JackAKirk Jul 7, 2023

Choose a reason for hiding this comment

gmlueck Jul 7, 2023

Choose a reason for hiding this comment

JackAKirk commented Jul 10, 2023

JackAKirk commented Jul 10, 2023

smaslov-intel left a comment

Choose a reason for hiding this comment

JackAKirk commented Feb 10, 2023 •

edited

Loading