Add multi-tensor hvd.grouped_allreduce API. #2453

romerojosh · 2020-11-16T19:16:22Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR introduces a new API to Horovod, hvd.grouped_allreduce. The purpose for this API is to give users explicit control over how Horovod fuses (or "groups") tensors for allreduce. Specifically, a list of tensors provided to hvd.grouped_allreduce will be treated logically as a single request, and will only be processed by the backend when all tensors in the list are available. This in contrast to Horovod's normal process, which will greedily fuse any available tensors during a cycle. While this greedy fusing is appropriate in many situations, a number of circumstances can arise where users may want greater control over how this fusion is done.

One situation is where a user wants to reduce the latency of Horovod coordination/negotiation via reducing the HOROVOD_CYCLE_TIME, but also wants to ensure that fused allreduce message do not become too small. This is not currently possible as the fusion message sizes and cycle time are tightly coupled. By defining explicit groups, the user is free to reduce the cycle time to as low a value as required for faster negotiation/coordination, while maintaining reasonable message sizes for network efficiency.

A second situation is when a user wants deterministic operation from Horovod. As it has been established previously, the dynamic packing of the fusion buffer can cause allreduce results to be non-deterministic, as the location of tensors in the fusion buffer can impact summation order. The only method to get deterministic results from Horovod right now is to disable fusion completely (via setting HOROVOD_FUSION_THRESHOLD=0), with an associated loss in performance. By explicitly defining the fusion groups via this API, deterministic fusion can be achieved, as we have a mechanism to guarantee that the fusion buffers will be packed with a deterministic ordering (both iteration to iteration, and run to run). It should be noted that by default, this feature will allow groups to fuse into larger groups, which reintroduces non-determinism. To disable this and run with deterministic groups, the environment variable HOROVOD_DISABLE_GROUP_FUSION has been added.

This feature is available both as a direct Horovod operation hvd.grouped_allreduce and through an additional argument (num_groups) to hvd.DistributedOptimizer/hvd.DistributedTrainer. By setting num_groups, Horovod will split the list of gradient tensors into the requested number of groups and use appropriate horovod.grouped_allreduce call to perform the gradient averaging operation.

This PR supersedes #1130 and updates to a cleaner API while also completing support for MXNet and PyTorch along with TensorFlow.

Signed-off-by: Josh Romero <joshr@nvidia.com>

…exception. Signed-off-by: Josh Romero <joshr@nvidia.com>

tgaddair

LGTM! No blockers from me, just a few nits. Feel free to land when ready.

horovod/torch/functions.py

tgaddair · 2020-11-18T20:31:49Z

horovod/torch/optimizer.py

+            # To ensure parameter order and group formation is consistent, broadcast p_list order
+            # from rank 0 and use for every worker
+            p_list_names = [self._parameter_names.get(p) for p in p_list]
+            p_list_names = broadcast_object(p_list_names, root_rank=0)


This helper function has turned out to be really useful.

Yes, I agree. 👍

tgaddair · 2020-11-18T20:33:04Z

horovod/torch/optimizer.py

+
+            # Form groups
+            d, r = divmod(len(p_list), self._num_groups)
+            p_groups = [tuple(p_list[n * d + min(n, r):(n + 1) * d + min(n + 1, r)]) for n in range(self._num_groups)]


I noticed this code in a few other places, would it make sense to abstract into a utility function in common?

Sure, I can do that.

tgaddair · 2020-11-18T20:39:50Z

horovod/common/group_table.cc

+  return tensor_name_to_id_.empty();
+}
+
+int32_t GroupTable::RegisterGroup(const std::vector<std::string>& tensor_names) {


Not a huge deal, but would be nice to consolidate a lot of duplication between these two functions.

Actually, I can just delete the duplicate functions as only one variant of the register and deregister group functions are being used anyway.

horovod/_keras/__init__.py

Signed-off-by: Josh Romero <joshr@nvidia.com>

github-actions · 2020-11-19T02:09:39Z

Unit Test Results

    528 files +  10   528 suites +10 4h 33m 44s ⏱️ + 12m 1s
    533 tests +  12   504 ✔️ +    9     28 💤 +    2 1 ❌ +1
10 624 runs +602 8 506 ✔️ +465 2 117 💤 +136 1 ❌ +1

For more details on these failures, see this check.

Results for commit 9b3ce49. ± Comparison against base commit c7848ca.

romerojosh mentioned this pull request Nov 16, 2020

Add grouped allreduce feature #1130

Closed

This comment has been minimized.

Sign in to view

romerojosh added 3 commits November 18, 2020 09:53

Add multi-tensor hvd.grouped_allreduce API.

d6946bf

Signed-off-by: Josh Romero <joshr@nvidia.com>

Fix PyTorch and Keras issues.

ae2931c

Signed-off-by: Josh Romero <joshr@nvidia.com>

Skip test for MXNet version < 1.5.0 that crashes instead of throwing …

5d24fff

…exception. Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the grouping_multitensor_pr branch from a1b82b8 to 5d24fff Compare November 18, 2020 17:56

tgaddair reviewed Nov 18, 2020

View reviewed changes

tgaddair approved these changes Nov 18, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

Address review comments.

9b3ce49

Signed-off-by: Josh Romero <joshr@nvidia.com>

tgaddair merged commit 775959d into horovod:master Nov 19, 2020

zhuzilin mentioned this pull request Dec 15, 2020

Add group_lists parameter in DistributedOptimizer for customized grouping #2523

Merged

4 tasks

romerojosh mentioned this pull request Apr 20, 2021

Expose groups/num_groups argument to tf.keras.DistributedOptimizer. #2851

Merged

4 tasks

wuyujiji mentioned this pull request Dec 16, 2021

Whether to support grouped_allgather in the future? #3325

Closed

maxhgerlach mentioned this pull request Jul 6, 2022

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-tensor hvd.grouped_allreduce API. #2453

Add multi-tensor hvd.grouped_allreduce API. #2453

romerojosh commented Nov 16, 2020

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

tgaddair Nov 18, 2020

romerojosh Nov 18, 2020

tgaddair Nov 18, 2020

romerojosh Nov 18, 2020

tgaddair Nov 18, 2020

romerojosh Nov 18, 2020

This comment has been minimized.

github-actions bot commented Nov 19, 2020

Add multi-tensor hvd.grouped_allreduce API. #2453

Add multi-tensor hvd.grouped_allreduce API. #2453

Conversation

romerojosh commented Nov 16, 2020

Checklist before submitting

Description

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Nov 18, 2020

Choose a reason for hiding this comment

romerojosh Nov 18, 2020

Choose a reason for hiding this comment

tgaddair Nov 18, 2020

Choose a reason for hiding this comment

romerojosh Nov 18, 2020

Choose a reason for hiding this comment

tgaddair Nov 18, 2020

Choose a reason for hiding this comment

romerojosh Nov 18, 2020

Choose a reason for hiding this comment

This comment has been minimized.

github-actions bot commented Nov 19, 2020

Unit Test Results