"cache-oblivious" MPI chunk subdivision #1492

stevengj · 2021-02-03T03:11:15Z

In a parallel MPI run with N nodes and M cores per node, for a total of NM processes, we ideally want to divide a simulation so that adjacent chunks are assigned to a single node as much as possible (so that we exploit fast intra-node communication for the chunk boundary conditions).

MPI provides us only limited information about this (with some very limited facilities for virtual "process topologies"). However, mpirun can easily be configured to ensure that consecutive process ranks are assigned within each node. If we do this, then our problem becomes: divide the chunks so that adjacent chunks have nearby ranks as much as possible.

Stated this way, the problem becomes very similar to maximizing "cache locality" with an unknown cache size M (and indeed, the intra-node memory can be thought of as a kind of "cache"). A classic approach to this problem is a cache-oblivious algorithm. In our case, that should essentially boil down to partitioning the simulation grid recursively, assigning MPI ranks in depth first order.

Fortunately, we already do such recursive partitioning in the split_by_cost algorithm, and the split_by_effort algorithm is similarly recursive although the algorithm is a bit weirder.

It would be good to go through this more carefully and determine whether there is anything to improve here, or whether we should simply document that MPI runs should be set up with consecutive ranks within shared-memory nodes.

The text was updated successfully, but these errors were encountered:

stevengj · 2021-02-03T03:15:16Z

Note that MPI ranks are assigned to chunks here: first we split the grid volume into an array of chunks, and then assign processes to those chunks consecutively. So what matters is the ordering of the chunks returned by choose_chunkdivision.

stevengj · 2021-02-03T03:18:29Z

Of course, it's not clear how much this matters, since the performance should presumably be limited by the slowest link, i.e. by the inter-node communications (which will always be present to some extent no matter how we order the nodes).

We could check this by randomly shuffling the chunk ordering and see how performance varies.

stevengj · 2021-02-10T03:08:50Z

Looking closely at split_by_effort, it is exactly a depth-first algorithm, so it already corresponds to the cache-oblivious algorithm.

stevengj added enhancement documentation labels Feb 3, 2021

stevengj closed this as completed Feb 10, 2021

oskooi mentioned this issue Feb 12, 2021

unify split_by_effort and split_by_cost #1499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"cache-oblivious" MPI chunk subdivision #1492

"cache-oblivious" MPI chunk subdivision #1492

stevengj commented Feb 3, 2021 •

edited

Loading

stevengj commented Feb 3, 2021 •

edited

Loading

stevengj commented Feb 3, 2021 •

edited

Loading

stevengj commented Feb 10, 2021

"cache-oblivious" MPI chunk subdivision #1492

"cache-oblivious" MPI chunk subdivision #1492

Comments

stevengj commented Feb 3, 2021 • edited Loading

stevengj commented Feb 3, 2021 • edited Loading

stevengj commented Feb 3, 2021 • edited Loading

stevengj commented Feb 10, 2021

stevengj commented Feb 3, 2021 •

edited

Loading

stevengj commented Feb 3, 2021 •

edited

Loading

stevengj commented Feb 3, 2021 •

edited

Loading