Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev support tensor pin memory #8073

Merged
merged 70 commits into from
May 9, 2022
Merged

Conversation

Flowingsun007
Copy link
Contributor

@Flowingsun007 Flowingsun007 commented Apr 21, 2022

相关issue:https://github.com/Oneflow-Inc/OneTeam/issues/1180

  • eager下支持Tensor.pin_memory()
  • eager下flow.empty支持设置pin_memory参数
  • test case && api docs
  • TensorSetItem支持scalar tensor + ellipsis indexing

image

@zhongshsh zhongshsh self-requested a review April 21, 2022 08:38
oneflow/api/python/functional/tensor_api.cpp Outdated Show resolved Hide resolved
oneflow/api/python/functional/tensor_api.cpp Outdated Show resolved Hide resolved
oneflow/api/python/functional/tensor_api.cpp Outdated Show resolved Hide resolved
oneflow/api/python/functional/tensor_api.cpp Outdated Show resolved Hide resolved
oneflow/api/python/functional/tensor_api.cpp Outdated Show resolved Hide resolved
oneflow/core/functional/impl/array_functor.cpp Outdated Show resolved Hide resolved
oneflow/core/framework/tensor_impl.cpp Outdated Show resolved Hide resolved
oneflow/core/eager/eager_blob_object.h Outdated Show resolved Hide resolved
oneflow/core/device/device_context.h Outdated Show resolved Hide resolved
@Flowingsun007 Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 13:44
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.9ms (= 12890.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 148.6ms (= 14857.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.15 (= 148.6ms / 128.9ms)

OneFlow resnet50 time: 78.7ms (= 7869.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.0ms (= 8498.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 85.0ms / 78.7ms)

OneFlow resnet50 time: 55.5ms (= 11090.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.3ms (= 12065.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 60.3ms / 55.5ms)

OneFlow resnet50 time: 42.5ms (= 8491.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 41.9ms (= 8372.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 0.99 (= 41.9ms / 42.5ms)

OneFlow resnet50 time: 35.1ms (= 7025.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.1ms (= 8212.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.17 (= 41.1ms / 35.1ms)

OneFlow swin dataloader time: 0.259s (= 51.753s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.230s / 200, num_workers=1)
Relative speed: 0.584 (= 0.151s / 0.259s)

OneFlow swin dataloader time: 0.065s (= 13.049s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.285s / 200, num_workers=4)
Relative speed: 0.635 (= 0.041s / 0.065s)

OneFlow swin dataloader time: 0.037s (= 7.460s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.580s / 200, num_workers=8)
Relative speed: 0.614 (= 0.023s / 0.037s)

❌ OneFlow resnet50 time: 146.8ms (= 14678.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.3ms (= 16529.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 165.3ms / 146.8ms)

OneFlow resnet50 time: 96.7ms (= 9674.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 122.7ms (= 12271.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 122.7ms / 96.7ms)

OneFlow resnet50 time: 74.3ms (= 14851.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 85.4ms (= 17087.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.15 (= 85.4ms / 74.3ms)

OneFlow resnet50 time: 63.0ms (= 12599.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15029.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 75.1ms / 63.0ms)

OneFlow resnet50 time: 55.7ms (= 11141.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.5ms (= 14909.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 74.5ms / 55.7ms)

@Flowingsun007 Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 15:03
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8073/

@Flowingsun007 Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 8, 2022 22:19
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Speed stats:

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8073/

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.2ms (= 12918.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.5ms (= 14054.3ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 140.5ms / 129.2ms)

OneFlow resnet50 time: 80.3ms (= 8029.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.3ms (= 8434.7ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 84.3ms / 80.3ms)

OneFlow resnet50 time: 52.4ms (= 10477.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.9ms (= 11579.0ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.11 (= 57.9ms / 52.4ms)

OneFlow resnet50 time: 41.5ms (= 8297.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.2ms (= 9642.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.16 (= 48.2ms / 41.5ms)

OneFlow resnet50 time: 36.3ms (= 7252.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 45.5ms (= 9095.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.25 (= 45.5ms / 36.3ms)

OneFlow swin dataloader time: 0.421s (= 84.248s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.165s / 200, num_workers=1)
Relative speed: 0.358 (= 0.151s / 0.421s)

OneFlow swin dataloader time: 0.066s (= 13.182s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.571s / 200, num_workers=4)
Relative speed: 0.650 (= 0.043s / 0.066s)

OneFlow swin dataloader time: 0.036s (= 7.197s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.426s / 200, num_workers=8)
Relative speed: 0.615 (= 0.022s / 0.036s)

❌ OneFlow resnet50 time: 145.6ms (= 14562.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.3ms (= 16731.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 167.3ms / 145.6ms)

OneFlow resnet50 time: 96.4ms (= 9642.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 110.4ms (= 11036.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 110.4ms / 96.4ms)

OneFlow resnet50 time: 76.1ms (= 15221.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 85.0ms (= 17008.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.12 (= 85.0ms / 76.1ms)

OneFlow resnet50 time: 64.4ms (= 12870.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.1ms (= 14810.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.15 (= 74.1ms / 64.4ms)

OneFlow resnet50 time: 56.8ms (= 11358.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.1ms (= 15018.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 75.1ms / 56.8ms)

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label May 9, 2022
@Flowingsun007 Flowingsun007 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 9, 2022 03:22
@Flowingsun007 Flowingsun007 merged commit b826253 into master May 9, 2022
@Flowingsun007 Flowingsun007 deleted the dev_support_tensor_pin_memory branch May 9, 2022 05:10
@Flowingsun007 Flowingsun007 mentioned this pull request Jun 20, 2022
3 tasks
@lixinqi
Copy link
Contributor

lixinqi commented Jun 24, 2022

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

可以不可以有更为简便的做法,比如我们引入特殊的StreamRole::kComputeOnPinMemory。这样的stream的allocator就是pinned memory。

Comment on lines +54 to +71
vm::Allocator* allocator = nullptr;
if (pin_memory) {
CHECK_EQ_OR_RETURN(device_ctx->device_type(), DeviceType::kCPU)
<< Error::RuntimeError() << "cannot pin tensor with device: " << device_ctx->device_type()
<< ", only dense CPU tensors can be pinned.";
allocator = dynamic_cast<CpuDeviceCtx*>(device_ctx)->mut_pin_memory_allocator();
if (allocator == nullptr) {
// for some reason, the pin_memory_allocator will fail to create
// e.g. with no CUDA library support and only can use oneflow in cpu only mode
return Error::RuntimeError()
<< "create pin_memory allocator failed for some reason. mostly, this error has "
"occurred because you are trying to use some CUDA functionality, but the CUDA "
"library has not been loaded by the dynamic linker for some reason.";
}
} else {
allocator = device_ctx->mut_allocator();
}
CHECK_NOTNULL_OR_RETURN(allocator) << Error::RuntimeError() << "allocator created failed!";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些特判都非常突兀。

@Flowingsun007
Copy link
Contributor Author

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

但是pin_memory支持确实存在2个需求:

  • 1.作为参数,需要从python层面带入到functor里(影响allocator的选择);
  • 2.然后是否实际被pin过这个信息貌似也需要在EagerBlobObject/Tensor层面记录。

不知道StreamRole::kComputeOnPinMemory这个能不能解决这两个需求😂

@lixinqi
Copy link
Contributor

lixinqi commented Jun 24, 2022

EagerBlobObject::pin_memory_入侵得特别厉害。从Functor到op_interpreter到EagerBlobObject一路都涉及。导致很多代码很突兀。

但是pin_memory支持确实存在2个需求:

  • 1.作为参数,需要从python层面带入到functor里(影响allocator的选择);
  • 2.然后是否实际被pin过这个信息貌似也需要在EagerBlobObject/Tensor层面记录。

不知道StreamRole::kComputeOnPinMemory这个能不能解决这两个需求😂

需求1的解决办法:重写XXXOp::InferDeviceAndStream方法。类似CopyOp::InferDeviceAndStream。
需求2的解决办法:EagerBlobObject::producer_stream()方法能获知内存分配所在的stream。当然也就能知道该EagerBlobObject是不是pin_memory。

以上都是现成的机制。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants