Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Free EagerTensor caught in nn.Graph build #5777

Merged
merged 24 commits into from
Aug 12, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Aug 6, 2021

动静转换 nn.Graph 支持 转化 Build 过程中(nn.Module)被捕获的 自由变量(Free EagerTensor)为 不可训练的 Variable Op。

比如 nn.Module forward 函数中临时创建的 Tensor。

  • MultiClientSessionContext 支持 保存、加载、移除 Graph Build 过程中捕获的 游离 EagerTensor
  • NNGraph 编译之前去 MultiClientSessionContext 中读取该 graph 的所有游离 EagerTensor,并记录保存下来
  • LazyInterpret 里处理根据 第一次 遇到游离 Tensor 时,创建一个不可训练的 VariableOp ,用于在 Runtime 启动后,将 该 EagerTensor 与 Variable 内存绑定。
  • api tensor 创建一个 Tensor 时,使用 flow.Tensor 或 flow.tensor 创建的 Tensor 应该是 EagerTensor,所以 functional::Empty 应该被 LazyMode::Grad(false) 所保护,使得 Empty Functor 是以 Eager 模式运行。
  • 添加测试脚本: 在 module forward 中 调用 x > 0.5 这样的测试句子,检查输出结果的正确性、Plan/Job 构图 的正确性
  1. 为什么要把游离 Tensor 保存到 MultiClientSessionContext 中?

由于 RunLazyJobInstruction 传入的参数里包含了 NNGraph(shared_ptr),且 VM 相应的 LazyJobComputeStream 会保存 last NNGraph 用于处理 NNGraph 运行时的 相同 Graph 流水、不同 Graph 互斥。所以有一个约束是 NNGraph 对象中不能 Hold 住任何的 Tensor 对象(shared_ptr),因为如果这么做的话,会使得这个 Tensor 的生命周期延长到非常非常晚。因此在处理 Graph Build 过程中遇到的游离 EagerTensor 时,我选择保存到 SessionContext 中,并在 NNGraph 析构时释放这些 Tensor 的 指针,使得 Tensor 的生命周期可以跟 Graph 保持同步。

  1. 为什么 flow.Tensor 里要加 LazyMode::Grad(false) ?

在 module forward 中写的 x > 0 之类的 代码会自动生成 flow.Tensor([0]) ,这时构建的 Tensor 应该是 EagerTensor (即 forward 时创建 游离 Tensor)。所以需要在 flow.Tensor 函数中临时禁用掉 LazyMode。

@chengtbf chengtbf added automerge and removed WIP work in progress labels Aug 8, 2021
@chengtbf chengtbf marked this pull request as ready for review August 8, 2021 08:10
@chengtbf chengtbf requested a review from oneflow-ci-bot August 8, 2021 08:24
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 139.5ms (= 6974.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 125.9ms (= 6292.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.11 (= 139.5ms / 125.9ms)

PyTorch resnet50 time: 81.9ms (= 4093.1ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 72.8ms (= 3637.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 81.9ms / 72.8ms)

PyTorch resnet50 time: 57.5ms (= 2875.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.4ms (= 2371.9ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.21 (= 57.5ms / 47.4ms)

PyTorch resnet50 time: 46.2ms (= 2312.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 43.1ms (= 2156.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.07 (= 46.2ms / 43.1ms)

PyTorch resnet50 time: 40.8ms (= 2040.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 40.3ms (= 2015.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.01 (= 40.8ms / 40.3ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 8, 2021 10:03
oneflow/api/python/framework/tensor.cpp Outdated Show resolved Hide resolved
oneflow/core/job/job_build_and_infer_ctx.cpp Outdated Show resolved Hide resolved
oneflow/core/framework/nn_graph.cpp Show resolved Hide resolved
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 11, 2021 09:56
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 11, 2021 09:56
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 11, 2021 11:36
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 11, 2021 11:36
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 11, 2021 12:58
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 11, 2021 16:29
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 11, 2021 18:39
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 11, 2021 20:13
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 11, 2021 21:58
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 139.3ms (= 6966.6ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.3ms (= 6415.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.09 (= 139.3ms / 128.3ms)

PyTorch resnet50 time: 84.7ms (= 4235.9ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.3ms (= 3716.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.14 (= 84.7ms / 74.3ms)

PyTorch resnet50 time: 58.9ms (= 2944.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 49.4ms (= 2469.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.19 (= 58.9ms / 49.4ms)

PyTorch resnet50 time: 49.1ms (= 2454.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 40.1ms (= 2003.8ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 49.1ms / 40.1ms)

PyTorch resnet50 time: 45.0ms (= 2249.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 45.0ms (= 2249.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.00 (= 45.0ms / 45.0ms)

@oneflow-ci-bot oneflow-ci-bot merged commit c4c3675 into master Aug 12, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_free_eager_tensor branch August 12, 2021 00:25
@strint strint mentioned this pull request Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants