Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

Closed
wants to merge 27 commits into from
Closed

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

wants to merge 27 commits into from

Conversation

ctcyang
Copy link
Contributor

@ctcyang ctcyang commented Jun 22, 2018

Description

Single machine All Reduce Topology-aware Communication

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Proposed communication method shows speed-up compared to both existing methods (parameter server and NCCL) on small batch sizes for ResNet-50, VGG-16, Inception-v3 and AlexNet.
  • Communication method queries the single-machine multi-GPU link topology, and determines a suitable communication pattern to use.
  • In future, will add auto-tuner to automatically choose between single-machine communication protocols (parameter server, NCCL, method proposed here).

Comments

@ctcyang
Copy link
Contributor Author

ctcyang commented Jun 22, 2018

@rahul003 @eric-haibin-lin

@@ -767,11 +780,13 @@ class CommDevice : public Comm {
return sparse_merged;
}

private:
private:
Copy link
Contributor

@haojin2 haojin2 Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should have no extra space here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed lint errors.

@@ -56,7 +57,12 @@ class KVStoreLocal : public KVStore {
*/
explicit KVStoreLocal(bool use_device_comm) : KVStore() {
if (use_device_comm) {
comm_ = new CommDevice();
bool tree = dmlc::GetEnv("MXNET_KVSTORE_USETREE", 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have python gpu kvstore test with MXNET_KVSTORE_USETREE set?

@@ -750,6 +762,8 @@ class CommDevice : public Comm {
std::vector<NDArray> compressed_send_buf;
/// \brief the small buffer for compressed data in receiver
std::vector<NDArray> compressed_recv_buf;
/// \brief size of allocation in case we do not actually allocate merged
TShape merged_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

// w = w * alpha*u
template <typename T>
inline void ewisemult(const std::vector<int>& u,
T alpha,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is weird

kvstore = mx.kv.create('device')
copy = mx.nd.random_normal(shape=(4,4), ctx=mx.gpu(0))
grad = copy.tostype("row_sparse")
envs = ["","1"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion: we could add some util class like https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L93-L119 which manages the scope of such env var. It sets the env var to some var when entering the scope, and reset the env var when exiting the scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added it to bd926bf

Carl Yang added 2 commits July 2, 2018 22:18
@ctcyang
Copy link
Contributor Author

ctcyang commented Jul 5, 2018

Recreated repo, which unfortunately has detached the branch from this PR. The code is at: https://github.com/ctcyang/incubator-mxnet/tree/feature_multirootv9

@@ -658,6 +671,42 @@ class CommDevice : public Comm {
}
}

using KeyAttrs = std::tuple<int, TShape, int>;
// try to allocate buff on device evenly
void InitMergeBuffer(const std::vector<Context>& devs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm did you make any change to this function? Asking because the move makes it hard to see the diff for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I made no changes to the existing --kv-store device.

// track of each key's shape within BufferEntry
// -this information is required for inherited Reduce- and
// BroadcastRowSparse
InitMergeBuffer(devs_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the regular merge buffer too?

Copy link
Contributor Author

@ctcyang ctcyang Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReduceRowSparse and BroadcastRowSparse will be implemented using topology-aware communication in the future. For now, the regular merge buffer is needed, so we can fallback to the existing --kv-store device behaviour for ReduceRowSparse and BroadcastRowSparse. This fallback is tested in the changed unittest tests/python/gpu/test_kvstore_gpu.py.

Due to the delay_alloc functionality, this does not cost any actual memory allocation if we don't end up using the InitMergeBuffer temporary memories.

@ctcyang
Copy link
Contributor Author

ctcyang commented Jul 6, 2018

Closed this PR. I deleted my old repo, and made a new one.

See new PR with code here: #11591

@ctcyang ctcyang closed this Jul 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants