You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a parallel MPI run with N nodes and M cores per node, for a total of NM processes, we ideally want to divide a simulation so that adjacent chunks are assigned to a single node as much as possible (so that we exploit fast intra-node communication for the chunk boundary conditions).
MPI provides us only limited information about this (with some very limited facilities for virtual "process topologies"). However, mpirun can easily be configured to ensure that consecutive process ranks are assigned within each node. If we do this, then our problem becomes: divide the chunks so that adjacent chunks have nearby ranks as much as possible.
Stated this way, the problem becomes very similar to maximizing "cache locality" with an unknown cache size M (and indeed, the intra-node memory can be thought of as a kind of "cache"). A classic approach to this problem is a cache-oblivious algorithm. In our case, that should essentially boil down to partitioning the simulation grid recursively, assigning MPI ranks in depth first order.
Fortunately, we already do such recursive partitioning in the split_by_cost algorithm, and the split_by_effort algorithm is similarly recursive although the algorithm is a bit weirder.
It would be good to go through this more carefully and determine whether there is anything to improve here, or whether we should simply document that MPI runs should be set up with consecutive ranks within shared-memory nodes.
The text was updated successfully, but these errors were encountered:
Note that MPI ranks are assigned to chunks here: first we split the grid volume into an array of chunks, and then assign processes to those chunks consecutively. So what matters is the ordering of the chunks returned by choose_chunkdivision.
Of course, it's not clear how much this matters, since the performance should presumably be limited by the slowest link, i.e. by the inter-node communications (which will always be present to some extent no matter how we order the nodes).
We could check this by randomly shuffling the chunk ordering and see how performance varies.
In a parallel MPI run with N nodes and M cores per node, for a total of NM processes, we ideally want to divide a simulation so that adjacent chunks are assigned to a single node as much as possible (so that we exploit fast intra-node communication for the chunk boundary conditions).
MPI provides us only limited information about this (with some very limited facilities for virtual "process topologies"). However,
mpirun
can easily be configured to ensure that consecutive process ranks are assigned within each node. If we do this, then our problem becomes: divide the chunks so that adjacent chunks have nearby ranks as much as possible.Stated this way, the problem becomes very similar to maximizing "cache locality" with an unknown cache size M (and indeed, the intra-node memory can be thought of as a kind of "cache"). A classic approach to this problem is a cache-oblivious algorithm. In our case, that should essentially boil down to partitioning the simulation grid recursively, assigning MPI ranks in depth first order.
Fortunately, we already do such recursive partitioning in the
split_by_cost
algorithm, and thesplit_by_effort
algorithm is similarly recursive although the algorithm is a bit weirder.It would be good to go through this more carefully and determine whether there is anything to improve here, or whether we should simply document that MPI runs should be set up with consecutive ranks within shared-memory nodes.
The text was updated successfully, but these errors were encountered: