Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU library conflict issue #3017

Merged

Conversation

JieyangChen7
Copy link
Contributor

This PR is for fixing a known GPU library conflict issue. When ADIOS2 and a third-party library (e.g., MGARD) are linked together and both of them use the Thrust/CUB library, calling Thrust/CUB APIs in the third-party libraries can cause runtime errors or unexpected behaviors. Since ADIOS2 only dependents on limited functionalities (i.e., just calculating min/max) in Thrust/CUB library, we choose to fix this issue by providing our own implementation of kernels for calculating min/max values on the GPU. This removes the depdency on Thrust/CUB library.

Copy link
Member

@JasonRuonanWang JasonRuonanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the best solution for now. It will save potential conflicts with other GPU based lossy compressors too.

Copy link
Contributor

@anagainaru anagainaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add all these changes in a different file and include them here to keep things clean.

}
};

struct MimOp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo MinOp

cudaMemcpyDeviceToHost);
cudaMemcpy(&max, thrust::raw_pointer_cast(res.second), sizeof(T),
cudaMemcpyDeviceToHost);
min = reduce<T, MimOp>(size, 1024, 64, 1, values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same typo again

@JasonRuonanWang JasonRuonanWang merged commit ebc4c8d into ornladios:master Jan 26, 2022
@vicentebolea
Copy link
Collaborator

I am seeing regressions when running ctest --verbose -R CUDA.

Screenshot from 2022-01-26 18-38-43

And a warning

Screenshot from 2022-01-26 18-39-51

Can you replicate this at your machine?

@vicentebolea
Copy link
Collaborator

On another topic I am late for this discussion but I am against taking this approach since we are rewriting what is already in Thrust (Adding an internal dependency and increasing the code complexity) and also whether this is really a problem that is for us to solve, the same issue can happen with the stdlibrary/glibc but you do not see libraries re-implementing stdlib implementations, at some point we have to draw a line and say if you want to have ADIOS2 with MGARD and CUDA you have to build both of them with the same CUDA SDK version.

Another thing is when we also internally ship the library in questions.

vicentebolea added a commit to vicentebolea/ADIOS2 that referenced this pull request Jan 26, 2022
…ibrary-conflict-issue"

This reverts commit ebc4c8d, reversing
changes made to f6cf3d8.
@germasch
Copy link
Contributor

On another topic I am late for this discussion but I am against taking this approach since we are rewriting what is already in Thrust (Adding an internal dependency and increasing the code complexity) and also whether this is really a problem that is for us to solve, the same issue can happen with the stdlibrary/glibc but you do not see libraries re-implementing stdlib implementations, at some point we have to draw a line and say if you want to have ADIOS2 with MGARD and CUDA you have to build both of them with the same CUDA SDK version.

Is it clear what exactly causes the problem? I fundamentally agree with it being preferable to fix the problem instead of working around it. But I don't really understand where the issue comes from in the first pace -- thrust is a header-only library for all I know, so it's not that adios2 and mgard or whatever end up linking to actual (possibly incompatible) libraries.

@JasonRuonanWang
Copy link
Member

@vicentebolea @germasch

We have spent quite a few days on this and @JieyangChen7 found others complaining about this CUDA bug too. So far the ONLY solution, whether you call it a fix or a work around, is to not use Thrust in ADIOS2. And we have verified that this worked. If anyone can come up with a smarter solution, you are more than welcome to propose or open a pull request. If not, then we probably have to go this way, as making ADIOS2+MGARD GPU work in production is one of our top priority goals at the moment. We have many critical activities pending on this. In the meantime, the GPU API MinMax calculation, I don't think it's used in any production workflows so far, so we have plenty of time to fix it if there is an issue.

@germasch I don't understand it either, but unfortunately I don't have time to understand everything. The best I can do is to ensure important things work, even if sometimes I don't understand.

@germasch
Copy link
Contributor

Do you have some pointer to a discussion of the problem and how it occurs? In particular, is there a way to reproduce it? Does it involve different versions of cuda, or is it just the two libraries both using thrust from the same cuda toolkit?

@JasonRuonanWang
Copy link
Member

JasonRuonanWang commented Jan 27, 2022

Do you have some pointer to a discussion of the problem and how it occurs? In particular, is there a way to reproduce it? Does it involve different versions of cuda, or is it just the two libraries both using thrust from the same cuda toolkit?

@JieyangChen7 can provide details about the issue. It's easy to reproduce, just compile ADIOS2 with CUDA enabled, and the mgard-x branch in https://github.com/JieyangChen7/MGARD.git also with CUDA enabled, and link ADIOS2 with this MGARD installation. Then you will be able to reproduce it by running ctest -VV -R MGARD. As soon as you remove everything Thrust in ADIOS2, the problem will disappear. Have a try if you have time. You will be our hero if you find a more decent solution :)

@JasonRuonanWang
Copy link
Member

@germasch P.S. The testbeds we used only have one CUDA Toolkit installation, so it's impossible to mess up CUDA versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants