nd boxing use nccl send/recv #7936

guo-ran · 2022-03-31T05:56:43Z

使用nccl send/recv支持任意src_parallel_desc == dst_parallel_desc 且device=kCUDA 且dst中没有P的boxing

clackhan · 2022-03-31T06:24:42Z

oneflow/core/operator/nccl_send_recv_boxing_op_util.cpp

+                           hierarchy_index_helper, in_nd_sbp, visit);
+  } else {
+    // If Split or PartialSum, go through all the ranks along the depth-dimension.
+    for (int64_t i = 0; i < parallel_hierarchy.dim_vec().at(depth); i++) {


这里直接parallel_hierarchy.At(depth)就可以了

clackhan · 2022-03-31T06:33:01Z

oneflow/core/operator/nccl_send_recv_boxing_op_util.cpp

+    CHECK_EQ(out_id, parallel_id);
+    const TensorSliceView& in_slice = in_slices.at(in_id);
+    const TensorSliceView& intersection = cur_rank_out_slice.Intersect(in_slice);
+    dst_recv_intersections->at(in_id) = intersection;


如果intersection是 empty 状态的话，两者没有交集，是不是可以跳过，在这里不用赋值，毕竟当前维度是broadcast的情况下，dst_recv_intersections也有“空洞”状态，所以维度为split或partial_sum的情况下，是不是只更新有交集的in_id就可以了？

嗯，确实可以去掉。
漏掉了这一句
if (intersection.IsEmpty()) { return; }

clackhan · 2022-03-31T06:41:03Z

oneflow/core/operator/nccl_send_recv_boxing_op_util.cpp

+    if (in_id != parallel_id) { return; }
+    const TensorSliceView& out_slice = out_slices.at(out_id);
+    const TensorSliceView& intersection = out_slice.Intersect(cur_rank_in_slice);
+    src_send_intersections->at(out_id) = intersection;


同上，intersection 不是 empty 的时候更新

clackhan · 2022-03-31T06:55:14Z

oneflow/core/operator/nccl_send_recv_boxing_op_util.cpp

+bool NdSbpNoPartialParallel(const NdSbp& nd_sbp) {
+  CHECK_GT(nd_sbp.sbp_parallel_size(), 0);
+  FOR_RANGE(int64_t, i, 0, nd_sbp.sbp_parallel_size()) {
+    if (nd_sbp.sbp_parallel(i).has_partial_sum_parallel()) { return false; }
+  }
+  return true;
+}


和oneflow/core/job/nd_sbp_util.h中的NdSbpHasPartialParallel，删除？

clackhan · 2022-03-31T07:06:18Z

oneflow/core/kernel/nccl_send_recv_boxing_kernel.cpp

+  OF_NCCL_CHECK(ncclGroupStart());
+  for (int64_t i = 0; i < parallel_num; ++i) {
+    if (send_elem_cnts.at(i) != 0) {
+      LOG(INFO) << parallel_id << " send " << send_elem_cnts.at(i) << " to " << i;


这个用VLOG(3)吧，不需要每次都把这个过程打印到日志里面，否则日志太长了

clackhan · 2022-03-31T07:06:25Z

oneflow/core/kernel/nccl_send_recv_boxing_kernel.cpp

+                             comm, cuda_stream));
+    }
+    if (recv_elem_cnts.at(i) != 0) {
+      LOG(INFO) << parallel_id << " recv " << recv_elem_cnts.at(i) << " from " << i;


clackhan · 2022-03-31T07:15:16Z

oneflow/core/kernel/nccl_send_recv_boxing_kernel.cpp

+      }
+    }
+  } else {
+    std::unique_ptr<ep::primitive::Add> primitive =


primitive 改成 add_primitive?

clackhan · 2022-03-31T07:33:07Z

oneflow/core/kernel/nccl_send_recv_boxing_kernel.cpp

+            void* out_buf = reinterpret_cast<void*>(buf_ptr + offset);
+            memset_primitive->Launch(ctx->stream(), out_buf, 0,
+                                     out->shape().elem_cnt() * GetSizeOfDataType(data_type));
+            out_tensor_slice_copier_vec.at(i)->Copy(ctx->stream(), out_buf, recv_out_ptr.at(i));
+            primitive->Launch(ctx->stream(), out->dptr(), out_buf, out->mut_dptr(),
+                              out->shape().elem_cnt());


这里好像并不是很有必要，if ... else ... 是不是这样就可以合并？不需要为output准备tmp buf

primitive->Launch(ctx->stream(), out->dptr(), recv_out_ptr.at(i), out->mut_dptr(), recv_elem_cnts.at(i));

把159行memset提到for 循环外边，src_nd_sbp_no_partial_parallel_为false的分支代码是不是可以更简洁？这样也并不需要为output准备tmp buf

std::unique_ptr<ep::primitive::Add> primitive = ep::primitive::NewPrimitive<ep::primitive::AddFactory>(ctx->stream()->device_type(), out->data_type()); CHECK(primitive); std::unique_ptr<ep::primitive::Memset> memset_primitive = ep::primitive::NewPrimitive<ep::primitive::MemsetFactory>(ctx->stream()->device_type()); CHECK(memset_primitive); memset_primitive->Launch(ctx->stream(), out->mut_dptr(), 0, out->shape().elem_cnt() * GetSizeOfDataType(data_type)); for (int64_t i = 0; i < parallel_num; ++i) { if (out_tensor_slice_copier_vec.at(i)) { primitive->Launch(ctx->stream(), out->dptr(), recv_out_ptr.at(i), out->mut_dptr(), recv_elem_cnts.at(i)); } }

形状不一样的话需要考虑offset，不一定是拷贝到指针开始的地方，源代码没问题

guo-ran · 2022-03-31T07:55:14Z

oneflow/core/graph/boxing/hierarchical_sub_task_graph_builder_impl.cpp

+        const int64_t machine_id = CHECK_JUST(out_parallel_desc.MachineId4ParallelId(out_id));
+        int64_t device_index = CHECK_JUST(out_parallel_desc.DeviceId4ParallelId(out_id));
+        int64_t thrd_id = EncodeStreamIdToInt64(GenerateNamedTaskStreamId(
+            machine_id, out_parallel_desc.device_type(), device_index, "NCCL_SEND_RECV_BOXING"));


不能用相同stream，不能保证顺序

"NCCL_SEND_RECV_BOXING" + NewUniqueId()

github-actions · 2022-03-31T16:03:27Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7936/

github-actions · 2022-03-31T16:12:24Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.2ms (= 12820.9ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.3ms (= 14130.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.3ms / 128.2ms)

✔️ OneFlow resnet50 time: 77.5ms (= 7754.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.3ms (= 8430.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.09 (= 84.3ms / 77.5ms)

OneFlow resnet50 time: 50.5ms (= 10100.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.7ms (= 11737.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.16 (= 58.7ms / 50.5ms)

OneFlow resnet50 time: 43.3ms (= 8654.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 49.7ms (= 9938.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.15 (= 49.7ms / 43.3ms)

OneFlow resnet50 time: 38.8ms (= 7766.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.2ms (= 7646.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.98 (= 38.2ms / 38.8ms)

OneFlow swin dataloader time: 0.258s (= 51.654s / 200, num_workers=1)
PyTorch swin dataloader time: 0.251s (= 50.121s / 200, num_workers=1)
✔️ Relative speed: 0.970 (= 0.251s / 0.258s)

OneFlow swin dataloader time: 0.067s (= 13.322s / 200, num_workers=4)
PyTorch swin dataloader time: 0.066s (= 13.176s / 200, num_workers=4)
✔️ Relative speed: 0.989 (= 0.066s / 0.067s)

OneFlow swin dataloader time: 0.036s (= 7.276s / 200, num_workers=8)
PyTorch swin dataloader time: 0.037s (= 7.415s / 200, num_workers=8)
✔️ Relative speed: 1.019 (= 0.037s / 0.036s)

✔️ OneFlow resnet50 time: 135.3ms (= 13526.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 156.8ms (= 15684.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 156.8ms / 135.3ms)

OneFlow resnet50 time: 87.6ms (= 8763.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 100.8ms (= 10082.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 100.8ms / 87.6ms)

OneFlow resnet50 time: 61.7ms (= 12347.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.8ms (= 15360.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 76.8ms / 61.7ms)

OneFlow resnet50 time: 53.3ms (= 10663.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.9ms (= 15170.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.42 (= 75.9ms / 53.3ms)

OneFlow resnet50 time: 49.3ms (= 9868.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.0ms (= 12192.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 61.0ms / 49.3ms)

Yipeng1994 · 2022-04-01T08:30:35Z

oneflow/core/operator/nccl_send_recv_boxing_op_util.cpp

+    visit(hierarchy_index_helper.NdIndexToOffset(out_parallel_ids.data(),
+                                                 parallel_hierarchy.NumAxes()),
+          hierarchy_index_helper.NdIndexToOffset(in_parallel_ids.data(),
+                                                 parallel_hierarchy.NumAxes()));
+    return;


visit没必要传2个参数。
只需要传一个in_id。
每卡的out_id是固定的。这里实际上是一个out_id 转到 NdIndex，
然后在多个枝叶上对毫无改动的 NdIndex 转回 out_id，然后下文只是检查了一下out_id经过了2次转换是否相等。

这样可以省掉 out_id的很多次转换。

这里去掉out_id的转换还有一个原因，就是在实现不同的placement的时候，你就会发现，这个out_id 只会在in_parallel_desc的情况下转化而成的 NdIdex才有意义。

打个比方，[0, 1, 2, 3] -> [1, 2, 3, 4]，
out_id 为0时是1卡，而1卡在out_parallel_desc对应的 NdIndex 是（0， 0），(0, 0) 对应的输入卡是 0卡。如果让1卡优先从0卡传输，明显是亏的。

考虑 [0, 1, 2, 3]: B -> [1, 2, 3, 4]: S(0)
传输量其实是 1/4 T, 而在out_parallel_desc的 NdIndex引导下传输量提升到了 T

Yipeng1994

测试过没问题

strint · 2022-05-16T08:23:03Z

python/oneflow/test/modules/test_nccl_send_recv_boxing.py

+        sbp=src_nd_sbp,
+        placement=placement,
+    )
+    graph = TestGraph()


这个测试比较适合放到 test/graph 下面？

github-actions · 2022-06-13T06:09:23Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.8ms (= 12979.4ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 145.7ms (= 14570.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.12 (= 145.7ms / 129.8ms)

OneFlow resnet50 time: 75.9ms (= 7588.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.4ms (= 8438.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.11 (= 84.4ms / 75.9ms)

OneFlow resnet50 time: 50.6ms (= 10120.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.6ms (= 11912.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 59.6ms / 50.6ms)

OneFlow resnet50 time: 42.1ms (= 8410.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.1ms (= 9013.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.07 (= 45.1ms / 42.1ms)

OneFlow resnet50 time: 37.4ms (= 7481.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.0ms (= 8201.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.10 (= 41.0ms / 37.4ms)

OneFlow swin dataloader time: 0.376s (= 75.266s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.447s / 200, num_workers=1)
Relative speed: 0.405 (= 0.152s / 0.376s)

OneFlow swin dataloader time: 0.101s (= 20.185s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.204s / 200, num_workers=4)
Relative speed: 0.406 (= 0.041s / 0.101s)

OneFlow swin dataloader time: 0.035s (= 7.008s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.521s / 200, num_workers=8)
Relative speed: 0.645 (= 0.023s / 0.035s)

❌ OneFlow resnet50 time: 146.1ms (= 14608.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.7ms (= 16867.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 168.7ms / 146.1ms)

OneFlow resnet50 time: 94.6ms (= 9460.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.2ms (= 11220.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 112.2ms / 94.6ms)

OneFlow resnet50 time: 73.2ms (= 14633.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.5ms (= 17699.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 88.5ms / 73.2ms)

OneFlow resnet50 time: 58.9ms (= 11771.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.8ms (= 15164.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 75.8ms / 58.9ms)

OneFlow resnet50 time: 54.6ms (= 10916.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.9ms (= 13782.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 68.9ms / 54.6ms)

github-actions · 2022-06-14T02:44:16Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

…flow-Inc/oneflow into dev_nd_nccl_send_recv_boxing

github-actions · 2022-06-14T05:45:20Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

…flow-Inc/oneflow into dev_nd_nccl_send_recv_boxing

github-actions · 2022-06-29T11:52:32Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

guo-ran · 2022-07-26T06:32:26Z

has been merged into master in #8437

nd nccl_send_recv_boxing

d428484

guo-ran added enhancement system labels Mar 31, 2022

guo-ran requested review from chengtbf, BBuf, daquexian and jackalcooper as code owners March 31, 2022 05:56

rm print

c1b5498

clackhan reviewed Mar 31, 2022

View reviewed changes

guo-ran commented Mar 31, 2022

View reviewed changes

support num_axes > 2

816f45a

guo-ran requested a review from oneflow-ci-bot March 31, 2022 13:13

guo-ran added the need-all-tests-even-fail label Mar 31, 2022

Yipeng1994 reviewed Apr 1, 2022

View reviewed changes

Yipeng1994 approved these changes Apr 1, 2022

View reviewed changes

strint reviewed May 16, 2022

View reviewed changes

strint mentioned this pull request May 27, 2022

Feat/logical nccl send recv #8318

Merged

merge master

384b5a2

guo-ran and others added 4 commits June 13, 2022 22:44

nccl send/recv support different placement

bcde424

refine

8ea068c

Merge branch 'master' into dev_nd_nccl_send_recv_boxing

f873826

auto format by CI

69f9a42

guo-ran and others added 3 commits June 14, 2022 13:43

rm out ctrl

384c621

Merge branch 'dev_nd_nccl_send_recv_boxing' of https://github.com/One…

491e69c

…flow-Inc/oneflow into dev_nd_nccl_send_recv_boxing

auto format by CI

66f11ba

guo-ran and others added 3 commits June 29, 2022 19:51

fix bug

a1a3835

Merge branch 'dev_nd_nccl_send_recv_boxing' of https://github.com/One…

6178b63

…flow-Inc/oneflow into dev_nd_nccl_send_recv_boxing

auto format by CI

19410a5

guo-ran closed this Jul 26, 2022

guo-ran deleted the dev_nd_nccl_send_recv_boxing branch September 15, 2022 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nd boxing use nccl send/recv #7936

nd boxing use nccl send/recv #7936

guo-ran commented Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

Yipeng1994 Apr 1, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

clackhan Mar 31, 2022

guo-ran Mar 31, 2022

guo-ran Jun 30, 2022

github-actions bot commented Mar 31, 2022

github-actions bot commented Mar 31, 2022

Yipeng1994 Apr 1, 2022

Yipeng1994 left a comment

strint May 16, 2022

github-actions bot commented Jun 13, 2022

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 29, 2022

guo-ran commented Jul 26, 2022

nd boxing use nccl send/recv #7936

nd boxing use nccl send/recv #7936

Conversation

guo-ran commented Mar 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 31, 2022

github-actions bot commented Mar 31, 2022

Choose a reason for hiding this comment

Yipeng1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 13, 2022

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 14, 2022

github-actions bot commented Jun 29, 2022

guo-ran commented Jul 26, 2022