Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More options on substreams #1561

Closed
jychoi-hpc opened this issue Jun 27, 2019 · 7 comments
Closed

More options on substreams #1561

jychoi-hpc opened this issue Jun 27, 2019 · 7 comments
Assignees

Comments

@jychoi-hpc
Copy link
Member

This is a feature request on aggregation (substreams).

I am wondering if we can add an option to set stride on aggregation (say, streamstride=Y).

Currently, if I run N processes and set X as substreams, rank 0 up to N/X-1 will be aggregated. I like to have an option to aggregate every Y-th processes (i.e., rank 0, Y-1, 2*Y-1, etc).

This will be helpful on Summit (and with SSD). Especially, XGC re-orders ranks and currently it is impossible to set to write XGC restart data by using one aggregator per node.

Any comment or suggestion will be appreciated.

@williamfgc
Copy link
Contributor

williamfgc commented Jun 27, 2019

@jychoi-hpc correct me if I don't understand the request correctly, would substreams=N/stride be what you're looking for?

@chuckatkins
Copy link
Contributor

I think it would be super useful if we could even generalize this a little more and have different grouping strategies. Something similar to the process distribution option that job schedulers have. We could allow something like:

  • block - consecutive grouping like is done now (0,1,2,3)(4,5,6,7)(8,9,10,11)
  • cyclic - round robin style (0,3,6,9)(1,4,7,10)(2,5,8,11)
  • plane=N - a mix of the two, so say N=2 then (0,1,6,7)(2,3,8,9)(4,5,10,11)
  • random - self explanatory

Implementing one extra would likely be not much different than all of them. Worth considering I think.

@jychoi-hpc
Copy link
Member Author

I like what @chuckatkins suggested.

Here is a simple case to use block and cyclic. In the following cases, here is a layout of ranks over 3 nodes (6 MPI processes per node).

Case A
node1:  0  1  2  3  4  5
node2:  6  7  8  9 10 11
node3: 12 13 14 15 16 17

Case B
node1:  0  3  6  9 12 15
node2:  1  4  7 10 13 16
node3:  2  5  8 11 14 17

If one want to use one aggregator per node, he/she can use block approach for Case A and cyclic for Case B.

@germasch
Copy link
Contributor

One potential option that would be very flexible might be to allow the application to pass in the split communicator itself, in addition to the default behavior that does the MPI_Comm_split internally in MPIAggregator.

@chuckatkins
Copy link
Contributor

Even cooler would be if there was a way to also have a user defined callback at each step to adjust the aggregation. It's something that is particularly of interest for viz where something like an isosurface is sparse but the nodes on where the data is dense changes across steps. That'd obviously be much more work but the initial support for different fixed aggregation strategies could lay the groundwork for that later on.

Initially though just having a few fixed strategies would be a great addition.

@williamfgc
Copy link
Contributor

@jychoi-hpc thanks for the example, let's start (after the release) with the case needed by XGC. Thanks!

@williamfgc williamfgc self-assigned this Jun 28, 2019
@pnorbert
Copy link
Contributor

This sounds okay to me but the most generic option is that we allow for passing a communicator in Open(). So an application can order the processes for the I/O in any way, e.g. to have consecutive ranks on one node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants