Avoid more comm/compute overlap in ugni #15193
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #12964 we eliminated comm/compute overlap for FMA operations when a
task hadn't done ~100 FMA operations. Here we extend that to RDMA and
chained FMA operations as well. This is motivated by aggregation work
where we do on-stmts followed by large RDMA PUTs/GETs, but we really
don't want too many tasks started at once since that increases memory
pressure. This also adds an option to completely disable comm/compute
overlap in ugni, but it's not enabled by default since it has a 2x
performance hit for SSCA.