Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocThrust for thrust functionality in HIP #560

Merged
merged 10 commits into from
May 18, 2024
Merged

Conversation

StewMH
Copy link
Contributor

@StewMH StewMH commented Apr 30, 2024

Add test of rocThrust library so we can use thrust on AMD GPUs in the future. Conditional on new TRACCC_BUILD_HIP flag: we don't have a plain HIP implementation of any algorithms so it's slight overkill.

Note that test_thrust.hip is just a copy of test_thrust.cu. So we could achieve the same via ifdefs and conditional CMake, although the directory test/cuda would ideally change name.

Surprisingly the two thrust libraries can coexist since only NVidia thrust defines THRUST_ENABLE_INSTALL_RULES and therefore the CMake commands defined by both do not clash. This is more or less by accident, but hopefully OK for now.

@StewMH
Copy link
Contributor Author

StewMH commented Apr 30, 2024

cc @CrossR, can add this on top of #558 once the reviews for that are complete.

@StewMH
Copy link
Contributor Author

StewMH commented Apr 30, 2024

Unclear to me why this is failing:

[ 12%] Creating directories for 'rocprim-download'
[ 25%] Performing download step (git clone) for 'rocprim-download'
-- rocprim-download download command succeeded.  See also /__w/traccc/traccc/build/rocprim-download/rocprim-download-prefix/src/rocprim-download-stamp/rocprim-download-download-*.log
[ 37%] No patch step for 'rocprim-download'
[ 50%] Performing configure step for 'rocprim-download'
CMake Error at /__w/traccc/traccc/build/rocprim-download/rocprim-download-prefix/src/rocprim-download-stamp/rocprim-download-configure-.cmake:49 (message):
  Command failed: 1

   '/usr/local/bin/cmake' '-DBUILD_TEST=OFF' '-DCMAKE_INSTALL_PREFIX=/__w/traccc/traccc/build/_deps/rocthrust-build/deps/rocprim' '-DCMAKE_PREFIX_PATH=/opt/rocm' '-GUnix Makefiles' '/__w/traccc/traccc/build/rocprim-src'

  See also

    /__w/traccc/traccc/build/rocprim-download/rocprim-download-prefix/src/rocprim-download-stamp/rocprim-download-configure-*.log


make[2]: *** [CMakeFiles/rocprim-download.dir/build.make:92: rocprim-download-prefix/src/rocprim-download-stamp/rocprim-download-configure] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/rocprim-download.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
CMake Error at build/_deps/rocthrust-src/cmake/DownloadProject.cmake:168 (message):
  Build step for rocprim failed: 2

Locally I have rocPRIM, but I can make it download with -DDOWNLOAD_ROCPRIM=TRUE. I can't replicate the issue, the download succeeds.

@StewMH
Copy link
Contributor Author

StewMH commented May 8, 2024

OK I ran locally with the Docker container. It complains that:

CMake Error at cmake/VerifyCompiler.cmake:29 (message):
  On ROCm platform 'hipcc' or HIP-aware Clang must be used as C++ compiler.

So we need a new container with rocm + clang. Trying it out with 22.04

@StewMH
Copy link
Contributor Author

StewMH commented May 8, 2024

This should now compile with acts-project/machines#96, so wait until ubuntu2204_rocm_clang is available.

@krasznaa
Copy link
Member

krasznaa commented May 8, 2024

🤔 I'm not all too sight about this... 😦 If rocThrust can't deal with HIP not being the thing used for the host code build, then we have a problem. 😦 Since they don't just want "any version" of clang. 🤔 They want the clang version shipped with HIP.

(It's a longer story, that AMD's "HIP language support in CMake" makes use of the underlying clang++ executable directly, they don't use the hipcc frontend. That setup is also a bit wonky with combining GCC + Clang for the project build. Though it does work somewhat at least.)

More than anything else, this looks like a shortcoming in their CMake configuration. I'm not at all convinced that the basic problem is on our side. Even if updating the Docker images might well be due.

@StewMH
Copy link
Contributor Author

StewMH commented May 8, 2024

We could potentially get round this by installing rocprim and maybe some other developer tools in advance. That's the setup I have locally.

@StewMH
Copy link
Contributor Author

StewMH commented May 8, 2024

Actually rocThrust demands clang:

https://github.com/ROCm/rocThrust/blob/develop/cmake/VerifyCompiler.cmake

despite being perfectly happy to build without it. There is an issue:
ROCm/rocThrust#238
which I can prod.

@StewMH
Copy link
Contributor Author

StewMH commented May 15, 2024

No response on the issue above from AMD. However, there is probably some value in running a clang build in the CI even if we are doing it for the wrong reasons.

The CI looks OK now - @krasznaa, @stephenswat could I have a review?

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things... 😉

@@ -30,6 +30,9 @@ jobs:
- name: CPU
container: ghcr.io/acts-project/ubuntu2004:v30
options: -DTRACCC_USE_ROOT=FALSE
- name: HIP
container: ghcr.io/acts-project/ubuntu2204_rocm_clang
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to put a particular version number here. We can't have the CI using the "latest" image. (Allowing the CI to break by somebody updating the image in some unforeseen way.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we need a new tag of machines for this. I could change this to ubuntu2204_rocm_clang:sha-3a6b0b2 for now, which matches latest. Or we can wait for a new machines tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should make a new tag of it. Though again, I believe that once we turn off the unit test builds of rocThrust, we could possibly survive with something as old as ghcr.io/acts-project/ubuntu1804_rocm:v11. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. It still insists on running its compiler checks with this change...

CMakeLists.txt Outdated Show resolved Hide resolved
cmake/traccc-compiler-options-hip.cmake Outdated Show resolved Hide resolved
cmake/traccc-compiler-options-hip.cmake Outdated Show resolved Hide resolved
extern/rocThrust/CMakeLists.txt Outdated Show resolved Hide resolved
tests/hip/CMakeLists.txt Show resolved Hide resolved
extern/rocThrust/CMakeLists.txt Outdated Show resolved Hide resolved
@StewMH
Copy link
Contributor Author

StewMH commented May 16, 2024

I followed up on the cmake warnings in ROCm/rocm-cmake#154

@krasznaa
Copy link
Member

Stewart, how would you feel about me intervening here? 🤔 I'd like to put krasznaa@1c26d46 on top of your developments. Technically I can do it, but didn't want to modify your branch before running it by you.

Long story short, I spent some time yesterday and today to force rocThrust into submission. To force its CMake configuration to work with our existing ROCm/HIP setups. Mind you, it now works for me with both HIP_PLATFORM=amd and HIP_PLATFORM=nvidia. 😄

Though there is still one glaring issue with all of this. We absolutely must depend on rocThrust privately only. As I don't see yet how we could export rocThrust properly with the rest of our code. (Their CMake kung-fu is not an amazing one.) 😦

@StewMH
Copy link
Contributor Author

StewMH commented May 17, 2024

Looks good to me. I did consider patching it but thought this might be getting a bit complicated

@krasznaa
Copy link
Member

So... We're on a good track, but still one issue to sort out... 🤔

This setup now works with older versions of ROCm/HIP, and with HIP's NVIDIA backend. (Even when using a newer version of HIP.) But at least with HIP 6.0.2, which I tested just now on my home PC that has an AMD card in it, this current setup fails to build with:

[build] [ 68%] Linking HIP executable ../../bin/traccc_test_hip
[build] /usr/bin/ld: /opt/rocm-6.0.2/lib/libamdhip64.so: undefined reference to `hsa_amd_memory_copy_engine_status@ROCR_1'
[build] /usr/bin/ld: /opt/rocm-6.0.2/lib/libamdhip64.so: undefined reference to `hsa_amd_memory_async_copy_on_engine@ROCR_1'
[build] clang++: error: linker command failed with exit code 1 (use -v to see invocation)
[build] gmake[2]: *** [tests/hip/CMakeFiles/traccc_test_hip.dir/build.make:105: bin/traccc_test_hip] Error 1
[build] gmake[1]: *** [CMakeFiles/Makefile2:5021: tests/hip/CMakeFiles/traccc_test_hip.dir/all] Error 2
[build] gmake[1]: *** Waiting for unfinished jobs....

This is an interesting issue. The missing symbols come from libhsa-runtime64.so. Which is absolutely part of my ROCm/HIP installation. But so far, for the naive HIP tests that we did in these projects, I never needed that library. 🤔 So

https://github.com/acts-project/vecmem/blob/main/cmake/FindHIPToolkit.cmake

doesn't look for it, it only sets up HIP::hiprt. (Which is the only library that we needed to link against explicitly so far.)

Long story short, we'll need a new VecMem tag as well before this PR would be in a really good state. 😦

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, that no. 🤔 It's rather a misconfiguration on my machine.

I have multiple different versions of ROCm installed here. With the "default" being version 5.4. (Since oneAPI's AMD plugin supports that version officially.) But in this build I was trying to rather use 6.0. But clearly my environment setting breaks down with that. 😦

As soon as I use version 5.4 (which has the same library setup), things work fine. So there is probably some need to make the VecMem code smarter later on, but this PR does not need to wait for that.

@krasznaa krasznaa merged commit 1441363 into acts-project:main May 18, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants