Fix GPU library conflict issue #3017

JieyangChen7 · 2022-01-25T02:57:59Z

This PR is for fixing a known GPU library conflict issue. When ADIOS2 and a third-party library (e.g., MGARD) are linked together and both of them use the Thrust/CUB library, calling Thrust/CUB APIs in the third-party libraries can cause runtime errors or unexpected behaviors. Since ADIOS2 only dependents on limited functionalities (i.e., just calculating min/max) in Thrust/CUB library, we choose to fix this issue by providing our own implementation of kernels for calculating min/max values on the GPU. This removes the depdency on Thrust/CUB library.

JasonRuonanWang

This seems to be the best solution for now. It will save potential conflicts with other GPU based lossy compressors too.

anagainaru

I would add all these changes in a different file and include them here to keep things clean.

anagainaru · 2022-01-25T13:59:43Z

source/adios2/helper/adiosCUDA.cu

+    }
+};
+
+struct MimOp


typo MinOp

anagainaru · 2022-01-25T14:02:57Z

source/adios2/helper/adiosCUDA.cu

-               cudaMemcpyDeviceToHost);
-    cudaMemcpy(&max, thrust::raw_pointer_cast(res.second), sizeof(T),
-               cudaMemcpyDeviceToHost);
+    min = reduce<T, MimOp>(size, 1024, 64, 1, values);


Same typo again

vicentebolea · 2022-01-26T23:40:49Z

I am seeing regressions when running ctest --verbose -R CUDA.

And a warning

Can you replicate this at your machine?

vicentebolea · 2022-01-26T23:51:16Z

On another topic I am late for this discussion but I am against taking this approach since we are rewriting what is already in Thrust (Adding an internal dependency and increasing the code complexity) and also whether this is really a problem that is for us to solve, the same issue can happen with the stdlibrary/glibc but you do not see libraries re-implementing stdlib implementations, at some point we have to draw a line and say if you want to have ADIOS2 with MGARD and CUDA you have to build both of them with the same CUDA SDK version.

Another thing is when we also internally ship the library in questions.

…ibrary-conflict-issue" This reverts commit ebc4c8d, reversing changes made to f6cf3d8.

germasch · 2022-01-27T01:25:18Z

On another topic I am late for this discussion but I am against taking this approach since we are rewriting what is already in Thrust (Adding an internal dependency and increasing the code complexity) and also whether this is really a problem that is for us to solve, the same issue can happen with the stdlibrary/glibc but you do not see libraries re-implementing stdlib implementations, at some point we have to draw a line and say if you want to have ADIOS2 with MGARD and CUDA you have to build both of them with the same CUDA SDK version.

Is it clear what exactly causes the problem? I fundamentally agree with it being preferable to fix the problem instead of working around it. But I don't really understand where the issue comes from in the first pace -- thrust is a header-only library for all I know, so it's not that adios2 and mgard or whatever end up linking to actual (possibly incompatible) libraries.

JasonRuonanWang · 2022-01-27T03:49:06Z

@vicentebolea @germasch

We have spent quite a few days on this and @JieyangChen7 found others complaining about this CUDA bug too. So far the ONLY solution, whether you call it a fix or a work around, is to not use Thrust in ADIOS2. And we have verified that this worked. If anyone can come up with a smarter solution, you are more than welcome to propose or open a pull request. If not, then we probably have to go this way, as making ADIOS2+MGARD GPU work in production is one of our top priority goals at the moment. We have many critical activities pending on this. In the meantime, the GPU API MinMax calculation, I don't think it's used in any production workflows so far, so we have plenty of time to fix it if there is an issue.

@germasch I don't understand it either, but unfortunately I don't have time to understand everything. The best I can do is to ensure important things work, even if sometimes I don't understand.

germasch · 2022-01-27T04:29:19Z

Do you have some pointer to a discussion of the problem and how it occurs? In particular, is there a way to reproduce it? Does it involve different versions of cuda, or is it just the two libraries both using thrust from the same cuda toolkit?

JasonRuonanWang · 2022-01-27T05:08:01Z

Do you have some pointer to a discussion of the problem and how it occurs? In particular, is there a way to reproduce it? Does it involve different versions of cuda, or is it just the two libraries both using thrust from the same cuda toolkit?

@JieyangChen7 can provide details about the issue. It's easy to reproduce, just compile ADIOS2 with CUDA enabled, and the mgard-x branch in https://github.com/JieyangChen7/MGARD.git also with CUDA enabled, and link ADIOS2 with this MGARD installation. Then you will be able to reproduce it by running ctest -VV -R MGARD. As soon as you remove everything Thrust in ADIOS2, the problem will disappear. Have a try if you have time. You will be our hero if you find a more decent solution :)

JasonRuonanWang · 2022-01-27T05:16:10Z

@germasch P.S. The testbeds we used only have one CUDA Toolkit installation, so it's impossible to mess up CUDA versions.

JieyangChen7 added 2 commits January 24, 2022 21:44

Add implementation of min/max on GPU; Remove dependecy of Thrust library

17a1289

Add copyright information as required when using source code from Nvidia

9c63d8f

JasonRuonanWang approved these changes Jan 25, 2022

View reviewed changes

anagainaru reviewed Jan 25, 2022

View reviewed changes

Put CUDA reduce in a separate file; Fix typo

491c703

anagainaru approved these changes Jan 25, 2022

View reviewed changes

Format

d311ae1

JasonRuonanWang merged commit ebc4c8d into ornladios:master Jan 26, 2022

vicentebolea added a commit to vicentebolea/ADIOS2 that referenced this pull request Jan 26, 2022

Revert "Merge pull request ornladios#3017 from JieyangChen7/fix-gpu-l…

7a2cc12

…ibrary-conflict-issue" This reverts commit ebc4c8d, reversing changes made to f6cf3d8.

vicentebolea mentioned this pull request Jan 27, 2022

Revert "Merge pull request #3017 from JieyangChen7/fix-gpu-library-conflict-issue" #3022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU library conflict issue #3017

Fix GPU library conflict issue #3017

JieyangChen7 commented Jan 25, 2022

JasonRuonanWang left a comment

anagainaru left a comment •

edited

Loading

anagainaru Jan 25, 2022

anagainaru Jan 25, 2022

vicentebolea commented Jan 26, 2022

vicentebolea commented Jan 26, 2022

germasch commented Jan 27, 2022

JasonRuonanWang commented Jan 27, 2022

germasch commented Jan 27, 2022

JasonRuonanWang commented Jan 27, 2022 •

edited

Loading

JasonRuonanWang commented Jan 27, 2022

Fix GPU library conflict issue #3017

Fix GPU library conflict issue #3017

Conversation

JieyangChen7 commented Jan 25, 2022

JasonRuonanWang left a comment

Choose a reason for hiding this comment

anagainaru left a comment • edited Loading

Choose a reason for hiding this comment

anagainaru Jan 25, 2022

Choose a reason for hiding this comment

anagainaru Jan 25, 2022

Choose a reason for hiding this comment

vicentebolea commented Jan 26, 2022

vicentebolea commented Jan 26, 2022

germasch commented Jan 27, 2022

JasonRuonanWang commented Jan 27, 2022

germasch commented Jan 27, 2022

JasonRuonanWang commented Jan 27, 2022 • edited Loading

JasonRuonanWang commented Jan 27, 2022

anagainaru left a comment •

edited

Loading

JasonRuonanWang commented Jan 27, 2022 •

edited

Loading