-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume error When running with cuda and mpi #11
Comments
@koparasy do you always see the same cell as the problem or does that change from time to time. Some of the CUDA versions have an unidentified race condition. I believe the fix since no one was able to find it was to synchronize after each kernel. Note this code was developed by Nvidia and is not officially maintained. I will reach out to them and see what the fix was and if they can provide anything. |
@ikarlin, No the cell id as well as the iteration number change on different executions. |
@koparasy thanks. I have confirmed with Nvidia this is the known race condition. We are discussing the best way to get the fix into the code. Do you have a timeline you need this done on? That might influence our choice. |
I'm having the same issue. Is the race condition solved now? |
I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus.
I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command:
mpirun -n 27 ./lulesh -s 60
and I get the following error:
Rank 22: Volume Error in cell 211619 at iteration 14
The error appears in different number of iterations on each execution.
Any idea what is causing this error?
The text was updated successfully, but these errors were encountered: