[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

ctcyang · 2018-06-22T00:53:50Z

Description

Single machine All Reduce Topology-aware Communication

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Proposed communication method shows speed-up compared to both existing methods (parameter server and NCCL) on small batch sizes for ResNet-50, VGG-16, Inception-v3 and AlexNet.
Communication method queries the single-machine multi-GPU link topology, and determines a suitable communication pattern to use.
In future, will add auto-tuner to automatically choose between single-machine communication protocols (parameter server, NCCL, method proposed here).

Comments

Design Proposal

…ight when num_gpus <= 8

…bator-mxnet into feature_multirootv9

…to feature_multirootv9

ctcyang · 2018-06-22T00:54:49Z

@rahul003 @eric-haibin-lin

haojin2 · 2018-06-22T00:58:33Z

src/kvstore/comm.h

@@ -767,11 +780,13 @@ class CommDevice : public Comm {
      return sparse_merged;
    }

-   private:
+    private:


nit: should have no extra space here.

Thanks, I fixed lint errors.

eric-haibin-lin · 2018-06-26T22:18:59Z

src/kvstore/kvstore_local.h

@@ -56,7 +57,12 @@ class KVStoreLocal : public KVStore {
   */
  explicit KVStoreLocal(bool use_device_comm) : KVStore() {
    if (use_device_comm) {
-      comm_ = new CommDevice();
+      bool tree = dmlc::GetEnv("MXNET_KVSTORE_USETREE", 0);


Can we also have python gpu kvstore test with MXNET_KVSTORE_USETREE set?

…oblem, add header guard

eric-haibin-lin · 2018-06-29T05:38:12Z

src/kvstore/comm.h

@@ -750,6 +762,8 @@ class CommDevice : public Comm {
    std::vector<NDArray> compressed_send_buf;
    /// \brief the small buffer for compressed data in receiver
    std::vector<NDArray> compressed_recv_buf;
+    /// \brief size of allocation in case we do not actually allocate merged
+    TShape merged_size;


Is this being used?

eric-haibin-lin · 2018-06-29T05:42:41Z

src/kvstore/gpu_topology.h

+//   w = w * alpha*u
+template <typename T>
+inline void ewisemult(const std::vector<int>& u,
+               T                       alpha,


indentation is weird

eric-haibin-lin · 2018-06-29T05:49:38Z

tests/python/gpu/test_kvstore_gpu.py

-    kvstore = mx.kv.create('device')
-    copy = mx.nd.random_normal(shape=(4,4), ctx=mx.gpu(0))
-    grad = copy.tostype("row_sparse")
+    envs = ["","1"]


minor suggestion: we could add some util class like https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L93-L119 which manages the scope of such env var. It sets the env var to some var when entering the scope, and reset the env var when exiting the scope.

Thanks, I added it to bd926bf

…se PCI-E as fallback for GPUs that are not linked by NVLink

ctcyang · 2018-07-05T20:00:35Z

Recreated repo, which unfortunately has detached the branch from this PR. The code is at: https://github.com/ctcyang/incubator-mxnet/tree/feature_multirootv9

rahul003 · 2018-07-05T22:23:30Z

src/kvstore/comm.h

@@ -658,6 +671,42 @@ class CommDevice : public Comm {
    }
  }

+  using KeyAttrs = std::tuple<int, TShape, int>;
+  // try to allocate buff on device evenly
+  void InitMergeBuffer(const std::vector<Context>& devs) {


Just to confirm did you make any change to this function? Asking because the move makes it hard to see the diff for this.

Nope. I made no changes to the existing --kv-store device.

rahul003 · 2018-07-05T22:25:38Z

src/kvstore/comm_tree.h

+      // track of each key's shape within BufferEntry
+      // -this information is required for inherited Reduce- and
+      //  BroadcastRowSparse
+      InitMergeBuffer(devs_);


Why do we need the regular merge buffer too?

ReduceRowSparse and BroadcastRowSparse will be implemented using topology-aware communication in the future. For now, the regular merge buffer is needed, so we can fallback to the existing --kv-store device behaviour for ReduceRowSparse and BroadcastRowSparse. This fallback is tested in the changed unittest tests/python/gpu/test_kvstore_gpu.py.

Due to the delay_alloc functionality, this does not cost any actual memory allocation if we don't end up using the InitMergeBuffer temporary memories.

ctcyang · 2018-07-06T20:26:44Z

Closed this PR. I deleted my old repo, and made a new one.

See new PR with code here: #11591

Carl Yang added 18 commits June 4, 2018 03:51

add multiroot all-reduce communication pattern

9678143

fix bug with UpdateWeight

d5e51d6

fix PCI-E links appearing in weight matrix bug

0708dbc

optimization to skip CopyFromTo in ReduceInner gains a bit of throughput

5590920

remove unnecessary if statement

4f8f58b

Add tests

908534a

add more tests, 6 tests left to add

25cbbdc

get rid of some dead code

310ee4d

Add comments

9cce8ea

Add randomized tests for backtrack and kernighan-lin

4d2790d

Fix Postprocess

b5b42bc

Add switch for first valid tree when num_gpus > 8, and for maximum we…

6327ceb

…ight when num_gpus <= 8

Kernighan-Lin seems to find better trees

8694fe7

get rid of printfs

c6cd67a

change defaults

7466c4d

Merge branch 'feature_multirootv9' of https://github.com/ctcyang/incu…

153ec0b

…bator-mxnet into feature_multirootv9

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

7c61b6c

…to feature_multirootv9

inherit from CommDevice instead of Comm

cc935a2

haojin2 reviewed Jun 22, 2018

View reviewed changes

Fix lint errors

ba60aaa

eric-haibin-lin mentioned this pull request Jun 26, 2018

[MXNET-331] NVLink communication pattern updated #8915

Closed

7 tasks

eric-haibin-lin reviewed Jun 26, 2018

View reviewed changes

Carl Yang added 5 commits June 27, 2018 00:25

Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation pr…

972e9c0

…oblem, add header guard

fix lint errors

6627dcf

better header guard that works for tests

4de89a7

get rid of unused variable warning

317c66b

retrigger jenkins

c364fd3

eric-haibin-lin reviewed Jun 29, 2018

View reviewed changes

resolve 2 comments

3241d71

Carl Yang added 2 commits July 2, 2018 22:18

address comment using Class to do test, get rid of extraneous test, u…

bd926bf

…se PCI-E as fallback for GPUs that are not linked by NVLink

resolve merge conflicts

0e1a704

rahul003 reviewed Jul 6, 2018

View reviewed changes

This was referenced Jul 6, 2018

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) ctcyang/incubator-mxnet#1

Closed

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

Merged

ctcyang closed this Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

ctcyang commented Jun 22, 2018

ctcyang commented Jun 22, 2018

haojin2 Jun 22, 2018 •

edited

Loading

ctcyang Jun 22, 2018

eric-haibin-lin Jun 26, 2018

eric-haibin-lin Jun 29, 2018

eric-haibin-lin Jun 29, 2018

eric-haibin-lin Jun 29, 2018

ctcyang Jul 2, 2018

ctcyang commented Jul 5, 2018

rahul003 Jul 5, 2018

ctcyang Jul 6, 2018

rahul003 Jul 5, 2018

ctcyang Jul 6, 2018 •

edited

Loading

ctcyang commented Jul 6, 2018 •

edited

Loading

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

Conversation

ctcyang commented Jun 22, 2018

Description

Checklist

Essentials

Changes

Comments

ctcyang commented Jun 22, 2018

haojin2 Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctcyang commented Jul 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctcyang Jul 6, 2018 • edited Loading

Choose a reason for hiding this comment

ctcyang commented Jul 6, 2018 • edited Loading

haojin2 Jun 22, 2018 •

edited

Loading

ctcyang Jul 6, 2018 •

edited

Loading

ctcyang commented Jul 6, 2018 •

edited

Loading