Dev support tensor pin memory #8073

Flowingsun007 · 2022-04-21T06:30:48Z

相关issue：https://github.com/Oneflow-Inc/OneTeam/issues/1180

eager下支持Tensor.pin_memory()
eager下flow.empty支持设置pin_memory参数
test case && api docs
TensorSetItem支持scalar tensor + ellipsis indexing

oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp

oneflow/api/python/functional/tensor_api.cpp

oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp

oneflow/core/functional/impl/array_functor.cpp

oneflow/core/framework/tensor_impl.cpp

oneflow/core/eager/eager_blob_object.h

oneflow/core/device/device_context.h

github-actions · 2022-05-08T14:46:04Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.9ms (= 12890.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 148.6ms (= 14857.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.15 (= 148.6ms / 128.9ms)

OneFlow resnet50 time: 78.7ms (= 7869.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.0ms (= 8498.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 85.0ms / 78.7ms)

OneFlow resnet50 time: 55.5ms (= 11090.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.3ms (= 12065.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 60.3ms / 55.5ms)

OneFlow resnet50 time: 42.5ms (= 8491.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 41.9ms (= 8372.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 0.99 (= 41.9ms / 42.5ms)

OneFlow resnet50 time: 35.1ms (= 7025.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.1ms (= 8212.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.17 (= 41.1ms / 35.1ms)

OneFlow swin dataloader time: 0.259s (= 51.753s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.230s / 200, num_workers=1)
Relative speed: 0.584 (= 0.151s / 0.259s)

OneFlow swin dataloader time: 0.065s (= 13.049s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.285s / 200, num_workers=4)
Relative speed: 0.635 (= 0.041s / 0.065s)

OneFlow swin dataloader time: 0.037s (= 7.460s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.580s / 200, num_workers=8)
Relative speed: 0.614 (= 0.023s / 0.037s)

❌ OneFlow resnet50 time: 146.8ms (= 14678.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.3ms (= 16529.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 165.3ms / 146.8ms)

OneFlow resnet50 time: 96.7ms (= 9674.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 122.7ms (= 12271.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 122.7ms / 96.7ms)

OneFlow resnet50 time: 74.3ms (= 14851.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 85.4ms (= 17087.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.15 (= 85.4ms / 74.3ms)

OneFlow resnet50 time: 63.0ms (= 12599.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15029.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 75.1ms / 63.0ms)

OneFlow resnet50 time: 55.7ms (= 11141.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.5ms (= 14909.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 74.5ms / 55.7ms)

github-actions · 2022-05-08T19:02:16Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8073/

github-actions · 2022-05-09T00:31:46Z

Speed stats:

github-actions · 2022-05-09T01:20:34Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8073/

github-actions · 2022-05-09T01:27:04Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.2ms (= 12918.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.5ms (= 14054.3ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 140.5ms / 129.2ms)

OneFlow resnet50 time: 80.3ms (= 8029.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.3ms (= 8434.7ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 84.3ms / 80.3ms)

OneFlow resnet50 time: 52.4ms (= 10477.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.9ms (= 11579.0ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.11 (= 57.9ms / 52.4ms)

OneFlow resnet50 time: 41.5ms (= 8297.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.2ms (= 9642.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.16 (= 48.2ms / 41.5ms)

OneFlow resnet50 time: 36.3ms (= 7252.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 45.5ms (= 9095.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.25 (= 45.5ms / 36.3ms)

OneFlow swin dataloader time: 0.421s (= 84.248s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.165s / 200, num_workers=1)
Relative speed: 0.358 (= 0.151s / 0.421s)

OneFlow swin dataloader time: 0.066s (= 13.182s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.571s / 200, num_workers=4)
Relative speed: 0.650 (= 0.043s / 0.066s)

OneFlow swin dataloader time: 0.036s (= 7.197s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.426s / 200, num_workers=8)
Relative speed: 0.615 (= 0.022s / 0.036s)

❌ OneFlow resnet50 time: 145.6ms (= 14562.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.3ms (= 16731.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 167.3ms / 145.6ms)

OneFlow resnet50 time: 96.4ms (= 9642.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 110.4ms (= 11036.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 110.4ms / 96.4ms)

OneFlow resnet50 time: 76.1ms (= 15221.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 85.0ms (= 17008.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.12 (= 85.0ms / 76.1ms)

OneFlow resnet50 time: 64.4ms (= 12870.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.1ms (= 14810.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.15 (= 74.1ms / 64.4ms)

OneFlow resnet50 time: 56.8ms (= 11358.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15018.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 75.1ms / 56.8ms)

github-actions · 2022-05-09T01:40:34Z

CI failed when running job: cuda-benchmark. PR label automerge has been removed

lixinqi · 2022-06-24T02:53:23Z

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

可以不可以有更为简便的做法，比如我们引入特殊的StreamRole::kComputeOnPinMemory。这样的stream的allocator就是pinned memory。

lixinqi · 2022-06-24T02:56:27Z

oneflow/core/eager/eager_blob_object.cpp

+  vm::Allocator* allocator = nullptr;
+  if (pin_memory) {
+    CHECK_EQ_OR_RETURN(device_ctx->device_type(), DeviceType::kCPU)
+        << Error::RuntimeError() << "cannot pin tensor with device: " << device_ctx->device_type()
+        << ", only dense CPU tensors can be pinned.";
+    allocator = dynamic_cast<CpuDeviceCtx*>(device_ctx)->mut_pin_memory_allocator();
+    if (allocator == nullptr) {
+      // for some reason, the pin_memory_allocator will fail to create
+      // e.g. with no CUDA library support and only can use oneflow in cpu only mode
+      return Error::RuntimeError()
+             << "create pin_memory allocator failed for some reason. mostly, this error has "
+                "occurred because you are trying to use some CUDA functionality, but the CUDA "
+                "library has not been loaded by the dynamic linker for some reason.";
+    }
+  } else {
+    allocator = device_ctx->mut_allocator();
+  }
+  CHECK_NOTNULL_OR_RETURN(allocator) << Error::RuntimeError() << "allocator created failed!";


这些特判都非常突兀。

Flowingsun007 · 2022-06-24T02:59:01Z

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

但是pin_memory支持确实存在2个需求：

1.作为参数，需要从python层面带入到functor里（影响allocator的选择）；
2.然后是否实际被pin过这个信息貌似也需要在EagerBlobObject/Tensor层面记录。

不知道StreamRole::kComputeOnPinMemory这个能不能解决这两个需求😂

lixinqi · 2022-06-24T03:04:44Z

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

但是pin_memory支持确实存在2个需求：

1.作为参数，需要从python层面带入到functor里（影响allocator的选择）；

2.然后是否实际被pin过这个信息貌似也需要在EagerBlobObject/Tensor层面记录。

不知道StreamRole::kComputeOnPinMemory这个能不能解决这两个需求😂

需求1的解决办法：重写XXXOp::InferDeviceAndStream方法。类似CopyOp::InferDeviceAndStream。
需求2的解决办法：EagerBlobObject::producer_stream()方法能获知内存分配所在的stream。当然也就能知道该EagerBlobObject是不是pin_memory。

以上都是现成的机制。

Flowingsun007 added 6 commits April 21, 2022 12:38

empty support pin_memory

4465cc2

refine

6ba0752

tensor.pin_memory()

3003e96

refine

2ae45df

add docs

b5b0e3f

Merge branch 'master' into dev_support_tensor_pin_memory

1486c60

Flowingsun007 marked this pull request as ready for review April 21, 2022 06:42

Flowingsun007 requested review from doombeaker, jackalcooper, hjchen2, BBuf, lixinqi, chengtbf, strint and daquexian as code owners April 21, 2022 06:42

Flowingsun007 enabled auto-merge (squash) April 21, 2022 06:42

Flowingsun007 added op api eager labels Apr 21, 2022

Flowingsun007 added 2 commits April 21, 2022 15:00

refine

2491a88

refine

e902246

Flowingsun007 commented Apr 21, 2022

View reviewed changes

oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp Show resolved Hide resolved

zhongshsh self-requested a review April 21, 2022 08:38

zhongshsh approved these changes Apr 21, 2022

View reviewed changes

support grad && add test case

510a3a4

hjchen2 reviewed Apr 21, 2022

View reviewed changes

fix comments

a09aa46

Flowingsun007 added the enhancement label Apr 21, 2022

Flowingsun007 added 2 commits April 21, 2022 21:50

Merge branch 'master' into dev_support_tensor_pin_memory

240b4fd

Merge branch 'master' into dev_support_tensor_pin_memory

7703f53

Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 13:44

Merge branch 'master' into dev_support_tensor_pin_memory

bd3d8dc

Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 15:03

Merge branch 'master' into dev_support_tensor_pin_memory

45067bc

Merge branch 'master' into dev_support_tensor_pin_memory

9a5499c

Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 22:19

Merge branch 'master' into dev_support_tensor_pin_memory

3749a7f

github-actions bot removed the automerge label May 9, 2022

Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 9, 2022 03:22

Flowingsun007 merged commit b826253 into master May 9, 2022

Flowingsun007 deleted the dev_support_tensor_pin_memory branch May 9, 2022 05:10

Flowingsun007 mentioned this pull request May 10, 2022

Dev tensor construct support pin memory #8176

Merged

3 tasks

Flowingsun007 mentioned this pull request Jun 20, 2022

Dev tensor is pinned api #8447

Merged

3 tasks

lixinqi reviewed Jun 24, 2022

View reviewed changes

Flowingsun007 mentioned this pull request Jun 27, 2022

Stream compute on pinned memory #8486

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev support tensor pin memory #8073

Dev support tensor pin memory #8073

Flowingsun007 commented Apr 21, 2022 •

edited

Loading

github-actions bot commented May 8, 2022

github-actions bot commented May 8, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

lixinqi commented Jun 24, 2022

lixinqi Jun 24, 2022

Flowingsun007 commented Jun 24, 2022

lixinqi commented Jun 24, 2022 •

edited

Loading

Dev support tensor pin memory #8073

Dev support tensor pin memory #8073

Conversation

Flowingsun007 commented Apr 21, 2022 • edited Loading

github-actions bot commented May 8, 2022

github-actions bot commented May 8, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

github-actions bot commented May 9, 2022

lixinqi commented Jun 24, 2022

lixinqi Jun 24, 2022

Choose a reason for hiding this comment

Flowingsun007 commented Jun 24, 2022

lixinqi commented Jun 24, 2022 • edited Loading

Flowingsun007 commented Apr 21, 2022 •

edited

Loading

lixinqi commented Jun 24, 2022 •

edited

Loading