Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume error When running with cuda and mpi #11

Open
koparasy opened this issue Mar 21, 2019 · 4 comments
Open

Volume error When running with cuda and mpi #11

koparasy opened this issue Mar 21, 2019 · 4 comments

Comments

@koparasy
Copy link
Member

I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus.
I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command:
mpirun -n 27 ./lulesh -s 60
and I get the following error:
Rank 22: Volume Error in cell 211619 at iteration 14
The error appears in different number of iterations on each execution.
Any idea what is causing this error?

@ikarlin
Copy link
Collaborator

ikarlin commented Mar 27, 2019

@koparasy do you always see the same cell as the problem or does that change from time to time. Some of the CUDA versions have an unidentified race condition. I believe the fix since no one was able to find it was to synchronize after each kernel.

Note this code was developed by Nvidia and is not officially maintained. I will reach out to them and see what the fix was and if they can provide anything.

@koparasy
Copy link
Member Author

@ikarlin, No the cell id as well as the iteration number change on different executions.

@ikarlin
Copy link
Collaborator

ikarlin commented Mar 28, 2019

@koparasy thanks. I have confirmed with Nvidia this is the known race condition. We are discussing the best way to get the fix into the code. Do you have a timeline you need this done on? That might influence our choice.

@HenryYihengXu
Copy link

I'm having the same issue. Is the race condition solved now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants