-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI bug when multiple GPUs are used per calculation #1130
Comments
I'm going to do a few more runs this week to check replicability. I'll post the results below. |
I ran same system above on shorter runs (500 iterations instead of 1500) with 1 GPU or 2 GPUs. It appears that the 2 GPU run is not mixing properly. See data in files attached. |
I can confirm I can reproduce this. I'll dig into what could be happening as soon as I have time. |
Great! This should be a high priority bug. |
I haven't had much time to work on this due to school, but one thing I've noticed is that approximately the same amount of mixing is reported as occurring from the log files between a single GPU and multiple GPU run. |
Thanks! Apologies I haven't been able to look into this yet, but other projects are keeping me busy. This is my next YANK-related thing to look at as soon as I can spare some time. |
Since we moved the |
We believe this is addressed (see choderalab/openmmtools#449 (comment)), though we are unclear what resolved the issue. |
When running a repex job for 2 systems on separate nodes with 2 GPUs per system, there appears to be uneven mixing of states depending on which GPU number the replica is. All of the even-numbered replicas (0 GPU) appear to be mixing decently, but the odd-numbered replicas (1 GPU) appear to not be mixing properly. The input files are below. The output files are at
/data/chodera/hgupta/repex_flatbottom_nal_md2/no_restraint/experiments
. I've also attached a script that reads the total number of states visited for each replica, and a sample output for data at/data/chodera/hgupta/repex_flatbottom_nal_md2/no_restraint/experiments/experiment-neg-MD2
.input_files.zip
count_states.zip
environment.yml.zip
The text was updated successfully, but these errors were encountered: