-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-tensor hvd.grouped_allreduce API. #2453
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…exception. Signed-off-by: Josh Romero <joshr@nvidia.com>
a1b82b8
to
5d24fff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! No blockers from me, just a few nits. Feel free to land when ready.
# To ensure parameter order and group formation is consistent, broadcast p_list order | ||
# from rank 0 and use for every worker | ||
p_list_names = [self._parameter_names.get(p) for p in p_list] | ||
p_list_names = broadcast_object(p_list_names, root_rank=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper function has turned out to be really useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. 👍
horovod/torch/optimizer.py
Outdated
|
||
# Form groups | ||
d, r = divmod(len(p_list), self._num_groups) | ||
p_groups = [tuple(p_list[n * d + min(n, r):(n + 1) * d + min(n + 1, r)]) for n in range(self._num_groups)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this code in a few other places, would it make sense to abstract into a utility function in common
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that.
horovod/common/group_table.cc
Outdated
return tensor_name_to_id_.empty(); | ||
} | ||
|
||
int32_t GroupTable::RegisterGroup(const std::vector<std::string>& tensor_names) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal, but would be nice to consolidate a lot of duplication between these two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I can just delete the duplicate functions as only one variant of the register and deregister group functions are being used anyway.
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Josh Romero <joshr@nvidia.com>
Unit Test Results 528 files + 10 528 suites +10 4h 33m 44s ⏱️ + 12m 1s For more details on these failures, see this check. Results for commit 9b3ce49. ± Comparison against base commit c7848ca. |
Checklist before submitting
Description
This PR introduces a new API to Horovod,
hvd.grouped_allreduce
. The purpose for this API is to give users explicit control over how Horovod fuses (or "groups") tensors for allreduce. Specifically, a list of tensors provided tohvd.grouped_allreduce
will be treated logically as a single request, and will only be processed by the backend when all tensors in the list are available. This in contrast to Horovod's normal process, which will greedily fuse any available tensors during a cycle. While this greedy fusing is appropriate in many situations, a number of circumstances can arise where users may want greater control over how this fusion is done.One situation is where a user wants to reduce the latency of Horovod coordination/negotiation via reducing the
HOROVOD_CYCLE_TIME
, but also wants to ensure that fused allreduce message do not become too small. This is not currently possible as the fusion message sizes and cycle time are tightly coupled. By defining explicit groups, the user is free to reduce the cycle time to as low a value as required for faster negotiation/coordination, while maintaining reasonable message sizes for network efficiency.A second situation is when a user wants deterministic operation from Horovod. As it has been established previously, the dynamic packing of the fusion buffer can cause allreduce results to be non-deterministic, as the location of tensors in the fusion buffer can impact summation order. The only method to get deterministic results from Horovod right now is to disable fusion completely (via setting
HOROVOD_FUSION_THRESHOLD=0
), with an associated loss in performance. By explicitly defining the fusion groups via this API, deterministic fusion can be achieved, as we have a mechanism to guarantee that the fusion buffers will be packed with a deterministic ordering (both iteration to iteration, and run to run). It should be noted that by default, this feature will allow groups to fuse into larger groups, which reintroduces non-determinism. To disable this and run with deterministic groups, the environment variableHOROVOD_DISABLE_GROUP_FUSION
has been added.This feature is available both as a direct Horovod operation
hvd.grouped_allreduce
and through an additional argument (num_groups
) tohvd.DistributedOptimizer/hvd.DistributedTrainer
. By settingnum_groups
, Horovod will split the list of gradient tensors into the requested number of groups and use appropriatehorovod.grouped_allreduce
call to perform the gradient averaging operation.This PR supersedes #1130 and updates to a cleaner API while also completing support for MXNet and PyTorch along with TensorFlow.