-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-331] Single machine All Reduce Topology-aware Communication #11357
Conversation
src/kvstore/comm.h
Outdated
@@ -767,11 +780,13 @@ class CommDevice : public Comm { | |||
return sparse_merged; | |||
} | |||
|
|||
private: | |||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should have no extra space here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed lint errors.
src/kvstore/kvstore_local.h
Outdated
@@ -56,7 +57,12 @@ class KVStoreLocal : public KVStore { | |||
*/ | |||
explicit KVStoreLocal(bool use_device_comm) : KVStore() { | |||
if (use_device_comm) { | |||
comm_ = new CommDevice(); | |||
bool tree = dmlc::GetEnv("MXNET_KVSTORE_USETREE", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have python gpu kvstore test with MXNET_KVSTORE_USETREE set?
src/kvstore/comm.h
Outdated
@@ -750,6 +762,8 @@ class CommDevice : public Comm { | |||
std::vector<NDArray> compressed_send_buf; | |||
/// \brief the small buffer for compressed data in receiver | |||
std::vector<NDArray> compressed_recv_buf; | |||
/// \brief size of allocation in case we do not actually allocate merged | |||
TShape merged_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used?
src/kvstore/gpu_topology.h
Outdated
// w = w * alpha*u | ||
template <typename T> | ||
inline void ewisemult(const std::vector<int>& u, | ||
T alpha, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is weird
kvstore = mx.kv.create('device') | ||
copy = mx.nd.random_normal(shape=(4,4), ctx=mx.gpu(0)) | ||
grad = copy.tostype("row_sparse") | ||
envs = ["","1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestion: we could add some util class like https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L93-L119 which manages the scope of such env var. It sets the env var to some var when entering the scope, and reset the env var when exiting the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added it to bd926bf
…se PCI-E as fallback for GPUs that are not linked by NVLink
Recreated repo, which unfortunately has detached the branch from this PR. The code is at: https://github.com/ctcyang/incubator-mxnet/tree/feature_multirootv9 |
@@ -658,6 +671,42 @@ class CommDevice : public Comm { | |||
} | |||
} | |||
|
|||
using KeyAttrs = std::tuple<int, TShape, int>; | |||
// try to allocate buff on device evenly | |||
void InitMergeBuffer(const std::vector<Context>& devs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm did you make any change to this function? Asking because the move makes it hard to see the diff for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I made no changes to the existing --kv-store device
.
// track of each key's shape within BufferEntry | ||
// -this information is required for inherited Reduce- and | ||
// BroadcastRowSparse | ||
InitMergeBuffer(devs_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the regular merge buffer too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReduceRowSparse and BroadcastRowSparse will be implemented using topology-aware communication in the future. For now, the regular merge buffer is needed, so we can fallback to the existing --kv-store device
behaviour for ReduceRowSparse and BroadcastRowSparse. This fallback is tested in the changed unittest tests/python/gpu/test_kvstore_gpu.py
.
Due to the delay_alloc
functionality, this does not cost any actual memory allocation if we don't end up using the InitMergeBuffer temporary memories.
Closed this PR. I deleted my old repo, and made a new one. See new PR with code here: #11591 |
Description
Single machine All Reduce Topology-aware Communication
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments