Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor sync_chunk_connections to remove barrier during fields::step #1635

Merged
merged 1 commit into from
Jun 30, 2021

Conversation

stevengj
Copy link
Collaborator

Fixes #1632.

@stevengj stevengj requested a review from oskooi June 24, 2021 21:48
@oskooi
Copy link
Collaborator

oskooi commented Jun 25, 2021

I ran some benchmarks comparing this branch and master for a test problem based on a 3d OLED simulation using a single-machine with Intel Xeon 2.0 GHz processors with 10, 20, and 40 MPI processes. The benchmark involved time stepping the fields for some number of timesteps after the pulsed source had turned off to ensure that the fields arrays contain non-zero values. The timers were reset after the first time step to remove the time for the array initialization, etc. from the timing statistics. The results shown below are for the test case involving 10 processes and the results for 20 and 40 processes are similar.

There are two things to note from these results:

  1. first figure: there is almost no change in the time stepping rate and also the total runtime of the simulation.
  2. second figure: the timing statistics for this branch (labelled "PR # 1635") shows that the time for all-to-all communication for all processes are zero but the time for the point-to-point communication has increased. This is probably just an error in the timing statistics but the key point is that the total MPI time does not change relative to master.

As a result, there does not seem to be any noticeable performance gains using this branch which is unexpected.

np10_sec_per_timestep_vs_time

barplot_np10

@stevengj
Copy link
Collaborator Author

stevengj commented Jun 25, 2021

It could be that to see a significant timing impact from a barrier you need to be on a large distributed-memory cluster?

It could be that the only reason the all-to-all communications (the barrier) are taking 12 seconds is due to imperfect load-balancing, i.e. waiting for the slowest process to catch up. In that case, it makes total sense that the barrier time is shifted to the point-to-point communications, because ultimately all the processors are synchronized by this anyway.

(I thought you guys did some profiling which was why you were complaining about this barrier?)

@ahoenselaar
Copy link
Contributor

We do see an improvement in our profiles. With the changes in this PR, cycles spent on connecting chunks effectively disappear.

@stevengj
Copy link
Collaborator Author

We do see an improvement in our profiles. With the changes in this PR, cycles spent on connecting chunks effectively disappear.

Are you sure that those cycles aren't simply shifting to someplace else, as in Ardavan's test?

@ahoenselaar
Copy link
Contributor

Given the noise in these benchmarks and the small fraction of <1% of cycles before the PR, I cannot determine that with a reasonable investment of effort.

@stevengj
Copy link
Collaborator Author

I'll go ahead and merge this, since it can't hurt to eliminate an Allreduce on every timestep. I'm not convinced that we're actually seeing a net performance benefit (as opposed to an accounting change), however.

@stevengj stevengj merged commit ff1b75f into master Jun 30, 2021
@stevengj stevengj deleted the sync_chunk_connections branch June 30, 2021 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove barrier from fields::connect_chunks
3 participants