Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-tensor hvd.grouped_allreduce API. #2453

Merged
merged 4 commits into from
Nov 19, 2020

Conversation

romerojosh
Copy link
Collaborator

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR introduces a new API to Horovod, hvd.grouped_allreduce. The purpose for this API is to give users explicit control over how Horovod fuses (or "groups") tensors for allreduce. Specifically, a list of tensors provided to hvd.grouped_allreduce will be treated logically as a single request, and will only be processed by the backend when all tensors in the list are available. This in contrast to Horovod's normal process, which will greedily fuse any available tensors during a cycle. While this greedy fusing is appropriate in many situations, a number of circumstances can arise where users may want greater control over how this fusion is done.

One situation is where a user wants to reduce the latency of Horovod coordination/negotiation via reducing the HOROVOD_CYCLE_TIME, but also wants to ensure that fused allreduce message do not become too small. This is not currently possible as the fusion message sizes and cycle time are tightly coupled. By defining explicit groups, the user is free to reduce the cycle time to as low a value as required for faster negotiation/coordination, while maintaining reasonable message sizes for network efficiency.

A second situation is when a user wants deterministic operation from Horovod. As it has been established previously, the dynamic packing of the fusion buffer can cause allreduce results to be non-deterministic, as the location of tensors in the fusion buffer can impact summation order. The only method to get deterministic results from Horovod right now is to disable fusion completely (via setting HOROVOD_FUSION_THRESHOLD=0), with an associated loss in performance. By explicitly defining the fusion groups via this API, deterministic fusion can be achieved, as we have a mechanism to guarantee that the fusion buffers will be packed with a deterministic ordering (both iteration to iteration, and run to run). It should be noted that by default, this feature will allow groups to fuse into larger groups, which reintroduces non-determinism. To disable this and run with deterministic groups, the environment variable HOROVOD_DISABLE_GROUP_FUSION has been added.

This feature is available both as a direct Horovod operation hvd.grouped_allreduce and through an additional argument (num_groups) to hvd.DistributedOptimizer/hvd.DistributedTrainer. By setting num_groups, Horovod will split the list of gradient tensors into the requested number of groups and use appropriate horovod.grouped_allreduce call to perform the gradient averaging operation.

This PR supersedes #1130 and updates to a cleaner API while also completing support for MXNet and PyTorch along with TensorFlow.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…exception.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! No blockers from me, just a few nits. Feel free to land when ready.

horovod/torch/functions.py Show resolved Hide resolved
# To ensure parameter order and group formation is consistent, broadcast p_list order
# from rank 0 and use for every worker
p_list_names = [self._parameter_names.get(p) for p in p_list]
p_list_names = broadcast_object(p_list_names, root_rank=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper function has turned out to be really useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. 👍


# Form groups
d, r = divmod(len(p_list), self._num_groups)
p_groups = [tuple(p_list[n * d + min(n, r):(n + 1) * d + min(n + 1, r)]) for n in range(self._num_groups)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this code in a few other places, would it make sense to abstract into a utility function in common?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that.

return tensor_name_to_id_.empty();
}

int32_t GroupTable::RegisterGroup(const std::vector<std::string>& tensor_names) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but would be nice to consolidate a lot of duplication between these two functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I can just delete the duplicate functions as only one variant of the register and deregister group functions are being used anyway.

horovod/_keras/__init__.py Show resolved Hide resolved
@github-actions

This comment has been minimized.

Signed-off-by: Josh Romero <joshr@nvidia.com>
@github-actions
Copy link

Unit Test Results

     528 files  +  10     528 suites  +10   4h 33m 44s ⏱️ + 12m 1s
     533 tests +  12     504 ✔️ +    9       28 💤 +    2  1 ❌ +1 
10 624 runs  +602  8 506 ✔️ +465  2 117 💤 +136  1 ❌ +1 

For more details on these failures, see this check.

Results for commit 9b3ce49. ± Comparison against base commit c7848ca.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants