Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"cache-oblivious" MPI chunk subdivision #1492

Closed
stevengj opened this issue Feb 3, 2021 · 3 comments
Closed

"cache-oblivious" MPI chunk subdivision #1492

stevengj opened this issue Feb 3, 2021 · 3 comments

Comments

@stevengj
Copy link
Collaborator

stevengj commented Feb 3, 2021

In a parallel MPI run with N nodes and M cores per node, for a total of NM processes, we ideally want to divide a simulation so that adjacent chunks are assigned to a single node as much as possible (so that we exploit fast intra-node communication for the chunk boundary conditions).

MPI provides us only limited information about this (with some very limited facilities for virtual "process topologies"). However, mpirun can easily be configured to ensure that consecutive process ranks are assigned within each node. If we do this, then our problem becomes: divide the chunks so that adjacent chunks have nearby ranks as much as possible.

Stated this way, the problem becomes very similar to maximizing "cache locality" with an unknown cache size M (and indeed, the intra-node memory can be thought of as a kind of "cache"). A classic approach to this problem is a cache-oblivious algorithm. In our case, that should essentially boil down to partitioning the simulation grid recursively, assigning MPI ranks in depth first order.

Fortunately, we already do such recursive partitioning in the split_by_cost algorithm, and the split_by_effort algorithm is similarly recursive although the algorithm is a bit weirder.

It would be good to go through this more carefully and determine whether there is anything to improve here, or whether we should simply document that MPI runs should be set up with consecutive ranks within shared-memory nodes.

@stevengj
Copy link
Collaborator Author

stevengj commented Feb 3, 2021

Note that MPI ranks are assigned to chunks here: first we split the grid volume into an array of chunks, and then assign processes to those chunks consecutively. So what matters is the ordering of the chunks returned by choose_chunkdivision.

@stevengj
Copy link
Collaborator Author

stevengj commented Feb 3, 2021

Of course, it's not clear how much this matters, since the performance should presumably be limited by the slowest link, i.e. by the inter-node communications (which will always be present to some extent no matter how we order the nodes).

We could check this by randomly shuffling the chunk ordering and see how performance varies.

@stevengj
Copy link
Collaborator Author

Looking closely at split_by_effort, it is exactly a depth-first algorithm, so it already corresponds to the cache-oblivious algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant