-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor sync_chunk_connections to remove barrier during fields::step #1635
Conversation
I ran some benchmarks comparing this branch and master for a test problem based on a 3d OLED simulation using a single-machine with Intel Xeon 2.0 GHz processors with 10, 20, and 40 MPI processes. The benchmark involved time stepping the fields for some number of timesteps after the pulsed source had turned off to ensure that the fields arrays contain non-zero values. The timers were reset after the first time step to remove the time for the array initialization, etc. from the timing statistics. The results shown below are for the test case involving 10 processes and the results for 20 and 40 processes are similar. There are two things to note from these results:
As a result, there does not seem to be any noticeable performance gains using this branch which is unexpected. |
It could be that to see a significant timing impact from a barrier you need to be on a large distributed-memory cluster? It could be that the only reason the all-to-all communications (the barrier) are taking 12 seconds is due to imperfect load-balancing, i.e. waiting for the slowest process to catch up. In that case, it makes total sense that the barrier time is shifted to the point-to-point communications, because ultimately all the processors are synchronized by this anyway. (I thought you guys did some profiling which was why you were complaining about this barrier?) |
We do see an improvement in our profiles. With the changes in this PR, cycles spent on connecting chunks effectively disappear. |
Are you sure that those cycles aren't simply shifting to someplace else, as in Ardavan's test? |
Given the noise in these benchmarks and the small fraction of <1% of cycles before the PR, I cannot determine that with a reasonable investment of effort. |
I'll go ahead and merge this, since it can't hurt to eliminate an |
Fixes #1632.