Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

Merged
merged 89 commits into from
Jul 10, 2023

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Feb 10, 2023

This implements the current extension doc from #6104 in the CUDA backend only.

Fixes #7543.
Fixes #6749.

npmiller and others added 7 commits February 6, 2023 14:30
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in intel#8124 and intel#7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in intel#6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
  accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
  applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
  to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
  contexts. It's unclear if there's any actual use case for this, it
  seems very uncommon in CUDA codebases to have multiple CUDA contexts
  for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
Older versions of gcc struggle with attributes on namespaces
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk changed the title [SYCL][CUDA] CUDA backend impl of ONEAPI P2P extension. [SYCL][CUDA] CUDA backend impl of ONEAPI USM P2P extension. Feb 10, 2023
@gmlueck
Copy link
Contributor

gmlueck commented Feb 10, 2023

This implements the current extension doc from #6104 (minus peer_access::access_enabled because it isn't natively supported by CUDA)

This should be resolved. We want our extensions to be fully implemented on all backends. If this part of the API cannot be implemented on CUDA, we should remove it from the extension spec. However, I thought it could be implemented by simply keeping track in software whether P2P has been enabled for each device. Don't we need that anyway in order to diagnose errors correctly?

@zjin-lcf
Copy link
Contributor

@JackAKirk I reported one of the issues. Is there some test program for me to execute ? Thanks.

@JackAKirk
Copy link
Contributor Author

@JackAKirk I reported one of the issues. Is there some test program for me to execute ? Thanks.

Yes, here are the two main use cases that I have been testing:
intel/llvm-test-suite@intel...JackAKirk:llvm-test-suite:p2p_examples
I've just cleaned them up a bit, hopefully they still compile OK. I haven't explicitly checked.

You can turn the access on and off and observe how this affects the P2P usage with nsys, but obviously you need two devices. Note that one interesting thing is that for the cuda backend the kernel access is unidirectional, but the P2P copies are bidirectional: if I enable p2p access of device 1 from device 0 then I can do P2P copies both ways, but P2P access only the direction I specified. I did not find any Nvidia documentation that explains this.

The P2P query function and how errors are handled is still subject to change in the specification.

@zjin-lcf
Copy link
Contributor

Thanks! I tried to run the modified program, but the result of the P2P memory copy is not right after P2P copy is enabled. Not sure if this is reproducible.

#include <cassert>
#include <memory>
#include <sycl/sycl.hpp>

using namespace sycl;

int main() {

  std::vector<sycl::device> Devs;

  // Note that this code is temporary due to the temporary lack of multiple devices per sycl context in the nvidia backend.
  ////////////////////////
  for (const auto &plt : sycl::platform::get_platforms()) {

    if (plt.get_backend() == sycl::backend::cuda)
      Devs.push_back(plt.get_devices()[0]);
  }
  ////////////////////////

  ///// Enable bi-directional peer copies
  Devs[0].ext_oneapi_enable_peer_access(Devs[1]);

  std::vector<sycl::queue> Queues;
  std::transform(Devs.begin(), Devs.end(), std::back_inserter(Queues),
      [](const sycl::device &D) { return sycl::queue{D}; });

  assert(Queues.size() > 1);

  int N = 100;
  int *input = (int *)malloc(sizeof(int) * N);
  for (int i = 0; i < N; i++) {
    input[i] = i;
  }

  int *arr0 = malloc<int>(N, Queues[0], usm::alloc::device);
  Queues[0].memcpy(arr0, input, N * sizeof(int)).wait();

  int *arr1 = malloc<int>(N, Queues[1], usm::alloc::device);

  // Copy device usm allocated in devices/cuContexts
  //Queues[0].copy(arr1, arr0, N).wait();
  Queues[1].copy(arr1, arr0, N).wait();
                                                          
  int *out;
  out = new int[N];
  //Queues[0].copy(out, arr1, N).wait();
  Queues[1].copy(out, arr1, N).wait();

  sycl::free(arr0, Queues[0]);
  sycl::free(arr1, Queues[1]);

  bool ok = true;
  for (int i = 0; i < N; i++) {
    if (out[i] != input[i]) {
      printf("%d %d\n", out[i], input[i]);
      ok = false; //break;
    }
  }
  delete[] out;

  printf("%s\n", ok ? "PASS" : "FAIL");

  return 0;
}

@JackAKirk
Copy link
Contributor Author

Thanks! I tried to run the modified program, but the result of the P2P memory copy is not right after P2P copy is enabled. Not sure if this is reproducible.

I've fixed it here: https://github.com/intel/llvm-test-suite/compare/intel...JackAKirk:llvm-test-suite:p2p_examples?expand=1
I had swapped src and dst in the memcpy calls. For some reason the spec has the order of them swapped wrt copy.
Thanks

@gmlueck
Copy link
Contributor

gmlueck commented Feb 28, 2023

I had swapped src and dst in the memcpy calls. For some reason the spec has the order of them swapped wrt copy.

You are noting that memcpy has a different parameter order from copy? This done on purpose to align with standard C++ functions. Standard C++ functions named "copy" have the source operand first and the destination operand second.

https://en.cppreference.com/w/cpp/algorithm/copy
https://en.cppreference.com/w/cpp/filesystem/copy

@JackAKirk JackAKirk changed the title [SYCL][CUDA] CUDA backend impl of ONEAPI USM P2P extension. [SYCL][CUDA] backend impl of ONEAPI USM P2P extension. Mar 3, 2023
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws March 3, 2023 19:51 — with GitHub Actions Inactive
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Mar 3, 2023

The updated p2p example in SYCL might be helpful for you.

https://github.com/zjin-lcf/HeCBench/tree/master/p2p-sycl

@JackAKirk JackAKirk temporarily deployed to aws March 3, 2023 20:44 — with GitHub Actions Inactive
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws March 6, 2023 11:55 — with GitHub Actions Inactive
@jandres742
Copy link
Contributor

@jandres742 @gmlueck @smaslov-intel are you OK with these latest changes?

What is the implementation status of this extension as of this PR? Is it fully implemented on CUDA? Is it implemented on other backends too? (I see changes to the Level Zero backend, for example.)

I see that the extension document is still in the "proposed" directory. Is it time to move it to "supported"?

@gmlueck : we will add the support to the L0 backend in a follow-up patch.

Copy link
Contributor

@jandres742 jandres742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on L0 and UR common code.

@JackAKirk
Copy link
Contributor Author

@jandres742 @gmlueck @smaslov-intel are you OK with these latest changes?

What is the implementation status of this extension as of this PR? Is it fully implemented on CUDA? Is it implemented on other backends too? (I see changes to the Level Zero backend, for example.)

I see that the extension document is still in the "proposed" directory. Is it time to move it to "supported"?

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

@gmlueck
Copy link
Contributor

gmlueck commented Jul 6, 2023

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

Is someone scheduled to do the remaining work soon? If yes, we can delay moving the spec until that happens. If there are no immediate plans, we should move the document in this PR so that CUDA users know that extension is available.

@JackAKirk
Copy link
Contributor Author

It is only fully implemented on cuda. For L0 and hip the p2p query function returns false, the enable/disable functions then return an error if they are called.
I can move it to supported at this point if you wish. I'm not sure when this normally happens.

Is someone scheduled to do the remaining work soon? If yes, we can delay moving the spec until that happens. If there are no immediate plans, we should move the document in this PR so that CUDA users know that extension is available.

I don't know if someone is scheduled to add the impl for l0 soon or not. It should be very simple but will require a system of two gpus or more for verification. I can move the document in this PR.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk requested a review from a team as a code owner July 7, 2023 08:55
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 09:11 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 10:21 — with GitHub Actions Inactive
std::ignore = param_value_size_ret;

die("piextPeerAccessGetInfo not "
"implemented in L0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than die, shouldn't we return some sort of "false" status, indicating that P2P isn't available (yet)? That way we can document this extension as "supported", and we can enable end-to-end tests on all backends.

Same for the other backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the following changes to the API specification:

  • Update the "Status" section using the wording in the template.

  • Add a section "Backend support status" noting that this extension is supported only for the CUDA backend. I'd suggest wording like:

This extension is currently implemented in DPC++ for all devices and backends, however, only the CUDA backend allows peer to peer memory access. Other backends report false from the ext_oneapi_can_access_peer query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've made these changes now.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 13:09 — with GitHub Actions Inactive
JackAKirk added 2 commits July 7, 2023 14:11
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 13:30 — with GitHub Actions Inactive

This extension is currently implemented in DPC++ for all GPU devices and
backends, however, only the CUDA backend allows peer to peer memory access.
Other backends report false from the ext_oneapi_can_access_peer query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Other backends report false from the ext_oneapi_can_access_peer query.
Other backends report false from the `ext_oneapi_can_access_peer query`.

Code font is better here.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 15:01 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws July 7, 2023 16:05 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

Any more reviews for this?

@JackAKirk JackAKirk temporarily deployed to aws July 10, 2023 09:36 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws July 10, 2023 10:15 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

@smaslov-intel can this be merged?

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @intel/llvm-gatekeepers would merge

@dm-vodopyanov dm-vodopyanov changed the title [SYCL][CUDA] backend impl of ONEAPI USM P2P extension. [SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension Jul 10, 2023
@dm-vodopyanov dm-vodopyanov merged commit 62ecb84 into intel:sycl Jul 10, 2023
15 checks passed
veselypeta pushed a commit to veselypeta/llvm that referenced this pull request Sep 21, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to fabiomestre/llvm that referenced this pull request Sep 26, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SYCL][CUDA] memory access across peer devices NVIDIA MultiGPU support for SYCL
10 participants