-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591
[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591
Conversation
…ight when num_gpus <= 8
…bator-mxnet into feature_multirootv9
…to feature_multirootv9
…oblem, add header guard
…se PCI-E as fallback for GPUs that are not linked by NVLink
Yeah, I'm blocked by the test. I can't replicate it on local machine, but I can replicate it on Docker image. |
@haojin2 @eric-haibin-lin @rahul003 Trying to get this into 1.3 release |
src/kvstore/comm_tree.h
Outdated
if (dest_id != topo_id) { | ||
CopyFromTo(buf_from.merged[merged_row], | ||
&(buf_dest.copy_buf[merged_row][is_dest-1]), | ||
priority); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align the lines like:
CopyFromTo(buf_...,
&(buf_...,
priority);
|
||
// ComputeTreesTest with backtracking | ||
// TODO(carlyang): comment out test for now | ||
/*TEST(GpuTopology, TestComputeTrees1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with these tests? Do they not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They used to segfault only on CI, but now they should be fine. I fixed an off-by-1 bug.
tests/python/gpu/test_nccl.py
Outdated
def __exit__(self, ptype, value, trace): | ||
os.environ[self._key] = self._prev_val | ||
|
||
def test_device_pushpull(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test in the file test_nccl?
src/kvstore/comm_tree.h
Outdated
CommDeviceTree() { | ||
inited_ = false; | ||
gpuarray_bound_ = dmlc::GetEnv("MXNET_KVSTORE_GPUARRAY_BOUND", 10000000); | ||
backtrack_ = dmlc::GetEnv("MXNET_KVSTORE_BACKTRACK", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document these environment variables in https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation.
src/kvstore/comm_tree.h
Outdated
std::vector<float> link_matrix(devs_.size()*devs_.size()); | ||
GetP2PWeight(devs_, &link_matrix); | ||
if (backtrack_) | ||
LOG(WARNING) << "Using Backtracking to generate trees"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be LOG(INFO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many other places as well where this change ought to be done
src/kvstore/gpu_topology.h
Outdated
} | ||
} | ||
|
||
// Performs partition on each existing partition in graph W if partition has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great comments for the functions, but could you fix comment style to how it's standard in the codebase. Here's an example https://github.com/apache/incubator-mxnet/blob/b4156da26cfe741619227ae726872b1255194900/src/kvstore/kvstore_utils.h#L37
src/kvstore/comm_tree.h
Outdated
std::vector<Context> devs_; | ||
|
||
/// \brief Highest numbered device | ||
int max_dev_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to see where this variable is used to ensure that cases when gpus '1,5,3,7' are given work. But it looks like this variable is not used? Please remove this then
src/kvstore/comm_tree.h
Outdated
// dev_id: 4 2 3 1 7 5 0 | ||
// and generated an n_gpus x n_gpus link topology matrix: | ||
// | ||
// 1) The reduction trees are saved as indices on 0, 1, ..., n_gpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify how many are generated
@Roshrini For keeping track of PR |
docs/faq/env_var.md
Outdated
- Values: 0(false) or 1(true) ```(default=0)``` | ||
- If true and MXNET_KVSTORE_USETREE is set to 1, MXNet will log the reduction trees that have been generated. | ||
|
||
* MXNET_KVSTORE_GPUARRAY_BOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it says multiple trees, but could you call out that this is for tree kvstore? especially because we have a similar variable MXNET_KVSTORE_BIGARRAY_BOUND
docs/faq/env_var.md
Outdated
- When the array size is bigger than this threshold and MXNET_KVSTORE_USETREE is set to 1, multiple trees are used to load balance the big gradient being communicated in order to better saturate link bandwidth. | ||
|
||
* MXNET_KVSTORE_BACKTRACK | ||
- Values: 0(false) or 1(true) ```(Default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting issue
…pdated) (apache#11591) * add multiroot all-reduce communication pattern * fix bug with UpdateWeight * fix PCI-E links appearing in weight matrix bug * optimization to skip CopyFromTo in ReduceInner gains a bit of throughput * remove unnecessary if statement * Add tests * add more tests, 6 tests left to add * get rid of some dead code * Add comments * Add randomized tests for backtrack and kernighan-lin * Fix Postprocess * Add switch for first valid tree when num_gpus > 8, and for maximum weight when num_gpus <= 8 * Kernighan-Lin seems to find better trees * get rid of printfs * change defaults * inherit from CommDevice instead of Comm * Fix lint errors * Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation problem, add header guard * fix lint errors * better header guard that works for tests * get rid of unused variable warning * retrigger jenkins * resolve 2 comments * address comment using Class to do test, get rid of extraneous test, use PCI-E as fallback for GPUs that are not linked by NVLink * address comments * fix a few bugs * get rid of printfs * get rid of print * Comment out test for now * fix 2 more bugs * fix segfault * change PrintVector, PrintTopo, PrintMatrix to LOG(INFO) instead of stdout * Fix code alignment * get rid of todo * Make changes to env variable names to indicate they are TREE-related * Add note saying when ARRAY_BOUND env var takes effect
Description
Single machine All Reduce Topology-aware Communication
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
MXNET_KVSTORE_USETREE=1
, default is 0.Comments