[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

alexbatashev · 2023-12-09T07:25:47Z

This patch introduces four major changes:

Enable XPTI tracing for out-of-tree adapters by actually linking them with XPTI static library and define XPTI_ENABLE_INSTRUMENTATION macro (which is globally defined in SYCL Runtime builds)
Replace stream name prefix "sycl.experimental" with "ur.adapter"
Eliminate debug stream. The reasons for its existance are historical. In SYCL Runtime there was no mechanism to capture PI layer arguments. The first version of XPTI would only provide function name, which was enough for profiling, but not enough for debugging purposes. Once argument capturing was introduced in XPTI, SYCL Runtime started exporting a new stream. The reason for that is preformance: packaging arguments introduces unwanted overhead for profilers. CUDA plugin (as well as Level Zero) simply followed the same pattern without reflecting on the reasons behind it. However, having two streams doesn't make much sense since CUPTI always returns both function arguments and return value in its callbacks. Hence, the debug stream is now call stream, and the original call stream is gone for good.
Add documentation on the actual contents of the call stream for CUDA Adapter.

This patch introduces four major changes: - Enable XPTI tracing for out-of-tree adapters by actually linking them with XPTI static library and define XPTI_ENABLE_INSTRUMENTATION macro (which is globally defined in SYCL Runtime builds) - Replace stream name prefix "sycl.experimental" with "ur.adapter" - Eliminate debug stream. The reasons for its existance are historical. In SYCL Runtime there was no mechanism to capture PI layer arguments. The first version of XPTI would only provide function name, which was enough for profiling, but not enough for debugging purposes. Once argument capturing was introduced in XPTI, SYCL Runtime started exporting a new stream. The reason for that is preformance: packaging arguments introduces unwanted overhead for profilers. CUDA plugin (as well as Level Zero) simply followed the same pattern without reflecting on the reasons behind it. However, having two streams doesn't make much sense since CUPTI always returns both function arguments and return value in its callbacks. Hence, the debug stream is now call stream, and the original call stream is gone for good. - Add documentation on the actual contents of the call stream for CUDA Adapter.

codecov-commenter · 2023-12-11T08:32:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e69ed21) 15.70% compared to head (942a85f) 15.70%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1173   +/-   ##
=======================================
  Coverage   15.70%   15.70%           
=======================================
  Files         223      223           
  Lines       31518    31518           
  Branches     3556     3556           
=======================================
  Hits         4951     4951           
  Misses      26516    26516           
  Partials       51       51

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbalcer · 2023-12-11T11:00:15Z

source/adapters/cuda/tracing.cpp

@@ -22,14 +22,11 @@
 #include <iostream>

 #ifdef XPTI_ENABLE_INSTRUMENTATION
-constexpr auto CUDA_CALL_STREAM_NAME = "sycl.experimental.cuda.call";
-constexpr auto CUDA_DEBUG_STREAM_NAME = "sycl.experimental.cuda.debug";
+constexpr auto CUDA_CALL_STREAM_NAME = "ur.adapter.cuda.call";


in the case of UR, the stream name is just ur. For consistency, should we call this one ur.adapter.cuda or rename the ur one to ur.call ? I have no preference.

I think, ur.call is a better choice, because there's so much more that I'd like UR to export to XPTI other than calls. Device-side profiling info, diagnostic messages, etc.

Changing existing stream name is also better done in a separate PR, and I guess, I'll have to prepare PRs for SYCL as well.

alexbatashev · 2023-12-11T11:31:31Z

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

pbalcer · 2023-12-11T12:17:51Z

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

Sounds good. I think the reason we used 3.14 is because that's what sycl cmake file uses - so that will need updating as well.
@kbenzie ?

There's another patch, #1070, that uses dlopen for cupti. It should solve this problem, but I think your suggested fix is reasonable regardless.

kbenzie · 2023-12-11T15:39:02Z

@pbalcer I see the configuration step failing due to a missing cupti library. I don't see that locally, but I was able to track down a differently configured machine and reproduce that issue. Apparently, UR uses CMake 3.14 and find_package(CUDA), which has been deprecated for a very long time. The reason behind that in the original SYCL repository was compatibility with upstream LLVM, which in turn did require compatibility with Ubuntu 20.04. Now that both UR being a separate project and LLVM uplifting CMake requirements to 3.20, we can switch to find_package(CUDAToolkit), which apparently resolves the issue. If there's no objection, I can make a separate change for that, and then rebase this patch.

Sounds good. I think the reason we used 3.14 is because that's what sycl cmake file uses - so that will need updating as well. @kbenzie ?

I think bumping the required CMake version should be reported as an issue so we can address it in our own time. No need to open an PR. There are a number of things to consider when doing a bump like this like making sure all our CI will continuing working with the change.

There's another patch, #1070, that uses dlopen for cupti. It should solve this problem, but I think your suggested fix is reasonable regardless.

This is necessary for other reasons so we will merge that in any case, if it also removes the need to bump the required CMake version that's a bonus.

kbenzie · 2023-12-11T15:40:10Z

Speaking of #1070, doesn't that fix this same issue?

alexbatashev · 2023-12-11T17:29:51Z

Speaking of #1070, doesn't that fix this same issue?

I don't see it setting XPTI_ENABLE_INSTRUMENTATION (but that's a minor issue), and I'm not really a big fan of dlopen. Normally, cupti is part of a default CUDA installation, but if you really have problems locating it on the system, I could suggest link it statically instead. Having cupti path hardcoded in the adapter may create problems on other systems with a different location of a library.

alexbatashev · 2023-12-11T17:41:09Z

I think bumping the required CMake version should be reported as an issue so we can address it in our own time. No need to open an PR. There are a number of things to consider when doing a bump like this like making sure all our CI will continuing working with the change.

As I said earlier, SYCL Runtime did have it aligned to LLVM version, and that was the very reason it didn't use CUDAToolkit. That's not the case now: https://github.com/llvm/llvm-project/blob/a4e67de96f0a9833756b6c79fff3cd6ee459fee0/llvm/CMakeLists.txt#L3 - SYCL is built with CMake 3.20 and whatever is in sycl/CMakeLists.txt is just ignored. As long as UR uses the same environment, the change should be safe.

kbenzie · 2023-12-12T10:28:02Z

@oneapi-src/unified-runtime-cuda-write and particularly @pasaulais should be involved in this discussion.

npmiller · 2023-12-13T11:59:48Z

Speaking of #1070, doesn't that fix this same issue?

I don't see it setting XPTI_ENABLE_INSTRUMENTATION (but that's a minor issue), and I'm not really a big fan of dlopen. Normally, cupti is part of a default CUDA installation, but if you really have problems locating it on the system, I could suggest link it statically instead. Having cupti path hardcoded in the adapter may create problems on other systems with a different location of a library.

We've had a lot of problem with just linking cupti because even though it may be part of the CUDA installation it's usually not setup somewhere where the system linker can find it, that seems like more of a CUDA package problem to me, and not something we can resolve in general.

And this is a big issue because when it can't find the cupti library the SYCL runtime just fails to load the CUDA plugin and just silently reports no available CUDA devices.

This is why we've decided to switch it to dlopen, that way we can have the plugin work just fine even if cupti is missing or not somewhere the system linker can find it. And then cupti only becomes necessary when people are trying to use tracing which makes sense.

kbenzie · 2024-05-24T15:37:08Z

Closing due to going with the approach in #1070.

alexbatashev requested review from a team as code owners December 9, 2023 07:25

pbalcer approved these changes Dec 11, 2023

View reviewed changes

kbenzie added specification Changes or additions to the specification cuda CUDA adapter specific issues labels Apr 10, 2024

kbenzie closed this May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

alexbatashev commented Dec 9, 2023

codecov-commenter commented Dec 11, 2023

pbalcer Dec 11, 2023

alexbatashev Dec 11, 2023

alexbatashev Dec 11, 2023

alexbatashev commented Dec 11, 2023

pbalcer commented Dec 11, 2023

kbenzie commented Dec 11, 2023

kbenzie commented Dec 11, 2023

alexbatashev commented Dec 11, 2023

alexbatashev commented Dec 11, 2023

kbenzie commented Dec 12, 2023

npmiller commented Dec 13, 2023

kbenzie commented May 24, 2024 •

edited

Loading

[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

[CUDA][XPTI] Fix XPTI-based CUDA tracing capabilities #1173

Conversation

alexbatashev commented Dec 9, 2023

codecov-commenter commented Dec 11, 2023

Codecov Report

pbalcer Dec 11, 2023

Choose a reason for hiding this comment

alexbatashev Dec 11, 2023

Choose a reason for hiding this comment

alexbatashev Dec 11, 2023

Choose a reason for hiding this comment

alexbatashev commented Dec 11, 2023

pbalcer commented Dec 11, 2023

kbenzie commented Dec 11, 2023

kbenzie commented Dec 11, 2023

alexbatashev commented Dec 11, 2023

alexbatashev commented Dec 11, 2023

kbenzie commented Dec 12, 2023

npmiller commented Dec 13, 2023

kbenzie commented May 24, 2024 • edited Loading

kbenzie commented May 24, 2024 •

edited

Loading