Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

Closed

Conversation

alexbatashev
Copy link
Contributor

This patch introduces four major changes:

  • Enable XPTI tracing for out-of-tree adapters by actually linking them with XPTI static library and define XPTI_ENABLE_INSTRUMENTATION macro (which is globally defined in SYCL Runtime builds)
  • Replace stream name prefix "sycl.experimental" with "ur.adapter"
  • Eliminate debug stream. The reasons for its existance are historical. In SYCL Runtime there was no mechanism to capture PI layer arguments. The first version of XPTI would only provide function name, which was enough for profiling, but not enough for debugging purposes. Once argument capturing was introduced in XPTI, SYCL Runtime started exporting a new stream. The reason for that is preformance: packaging arguments introduces unwanted overhead for profilers. CUDA plugin (as well as Level Zero) simply followed the same pattern without reflecting on the reasons behind it. However, having two streams doesn't make much sense since CUPTI always returns both function arguments and return value in its callbacks. Hence, the debug stream is now call stream, and the original call stream is gone for good.
  • Add documentation on the actual contents of the call stream for CUDA Adapter.

This patch introduces four major changes:
- Enable XPTI tracing for out-of-tree adapters by actually linking them with XPTI static library
  and define XPTI_ENABLE_INSTRUMENTATION macro (which is globally defined in SYCL Runtime builds)
- Replace stream name prefix "sycl.experimental" with "ur.adapter"
- Eliminate debug stream. The reasons for its existance are historical. In SYCL Runtime there was
  no mechanism to capture PI layer arguments. The first version of XPTI would only provide
  function name, which was enough for profiling, but not enough for debugging purposes. Once
  argument capturing was introduced in XPTI, SYCL Runtime started exporting a new stream. The
  reason for that is preformance: packaging arguments introduces unwanted overhead for profilers.
  CUDA plugin (as well as Level Zero) simply followed the same pattern without reflecting on the
  reasons behind it. However, having two streams doesn't make much sense since CUPTI always returns
  both function arguments and return value in its callbacks. Hence, the debug stream is now call
  stream, and the original call stream is gone for good.
- Add documentation on the actual contents of the call stream for CUDA Adapter.
@alexbatashev alexbatashev requested review from a team as code owners December 9, 2023 07:25
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e69ed21) 15.70% compared to head (942a85f) 15.70%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1173   +/-   ##
=======================================
  Coverage   15.70%   15.70%           
=======================================
  Files         223      223           
  Lines       31518    31518           
  Branches     3556     3556           
=======================================
  Hits         4951     4951           
  Misses      26516    26516           
  Partials       51       51           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -22,14 +22,11 @@
#include <iostream>

#ifdef XPTI_ENABLE_INSTRUMENTATION
constexpr auto CUDA_CALL_STREAM_NAME = "sycl.experimental.cuda.call";
constexpr auto CUDA_DEBUG_STREAM_NAME = "sycl.experimental.cuda.debug";
constexpr auto CUDA_CALL_STREAM_NAME = "ur.adapter.cuda.call";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of UR, the stream name is just ur. For consistency, should we call this one ur.adapter.cuda or rename the ur one to ur.call ? I have no preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ur.call is a better choice, because there's so much more that I'd like UR to export to XPTI other than calls. Device-side profiling info, diagnostic messages, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing existing stream name is also better done in a separate PR, and I guess, I'll have to prepare PRs for SYCL as well.

@alexbatashev
Copy link
Contributor Author

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

@pbalcer
Copy link
Contributor

pbalcer commented Dec 11, 2023

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

Sounds good. I think the reason we used 3.14 is because that's what sycl cmake file uses - so that will need updating as well.
@kbenzie ?

There's another patch, #1070, that uses dlopen for cupti. It should solve this problem, but I think your suggested fix is reasonable regardless.

@kbenzie
Copy link
Contributor

kbenzie commented Dec 11, 2023

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

Sounds good. I think the reason we used 3.14 is because that's what sycl cmake file uses - so that will need updating as well. @kbenzie ?

I think bumping the required CMake version should be reported as an issue so we can address it in our own time. No need to open an PR. There are a number of things to consider when doing a bump like this like making sure all our CI will continuing working with the change.

There's another patch, #1070, that uses dlopen for cupti. It should solve this problem, but I think your suggested fix is reasonable regardless.

This is necessary for other reasons so we will merge that in any case, if it also removes the need to bump the required CMake version that's a bonus.

@kbenzie
Copy link
Contributor

kbenzie commented Dec 11, 2023

Speaking of #1070, doesn't that fix this same issue?

@alexbatashev
Copy link
Contributor Author

Speaking of #1070, doesn't that fix this same issue?

I don't see it setting XPTI_ENABLE_INSTRUMENTATION (but that's a minor issue), and I'm not really a big fan of dlopen. Normally, cupti is part of a default CUDA installation, but if you really have problems locating it on the system, I could suggest link it statically instead. Having cupti path hardcoded in the adapter may create problems on other systems with a different location of a library.

@alexbatashev
Copy link
Contributor Author

I think bumping the required CMake version should be reported as an issue so we can address it in our own time. No need to open an PR. There are a number of things to consider when doing a bump like this like making sure all our CI will continuing working with the change.

As I said earlier, SYCL Runtime did have it aligned to LLVM version, and that was the very reason it didn't use CUDAToolkit. That's not the case now: https://github.com/llvm/llvm-project/blob/a4e67de96f0a9833756b6c79fff3cd6ee459fee0/llvm/CMakeLists.txt#L3 - SYCL is built with CMake 3.20 and whatever is in sycl/CMakeLists.txt is just ignored. As long as UR uses the same environment, the change should be safe.

@kbenzie
Copy link
Contributor

kbenzie commented Dec 12, 2023

@oneapi-src/unified-runtime-cuda-write and particularly @pasaulais should be involved in this discussion.

@npmiller
Copy link
Contributor

Speaking of #1070, doesn't that fix this same issue?

I don't see it setting XPTI_ENABLE_INSTRUMENTATION (but that's a minor issue), and I'm not really a big fan of dlopen. Normally, cupti is part of a default CUDA installation, but if you really have problems locating it on the system, I could suggest link it statically instead. Having cupti path hardcoded in the adapter may create problems on other systems with a different location of a library.

We've had a lot of problem with just linking cupti because even though it may be part of the CUDA installation it's usually not setup somewhere where the system linker can find it, that seems like more of a CUDA package problem to me, and not something we can resolve in general.

And this is a big issue because when it can't find the cupti library the SYCL runtime just fails to load the CUDA plugin and just silently reports no available CUDA devices.

This is why we've decided to switch it to dlopen, that way we can have the plugin work just fine even if cupti is missing or not somewhere the system linker can find it. And then cupti only becomes necessary when people are trying to use tracing which makes sense.

@kbenzie kbenzie added specification Changes or additions to the specification cuda CUDA adapter specific issues labels Apr 10, 2024
@kbenzie
Copy link
Contributor

kbenzie commented May 24, 2024

Closing due to going with the approach in #1070.

@kbenzie kbenzie closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA adapter specific issues specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants