Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one_embedding amp default fp16 #8174

Merged
merged 10 commits into from
May 13, 2022
Merged

Conversation

guo-ran
Copy link
Contributor

@guo-ran guo-ran commented May 9, 2022

将embedding_lookup_placeholder加入白名单,因此amp时variable输入shadow会插入cast_f2h,本op根据shadow的data_type推导输出data_type。 在amp时,前后向输入输出均为half类型
shadow的输入可能是variable或cast
在replace_embedding_ops pass中:
对于后向操作
如果ONEFLOW_ONE_EMBEDDING_GRADIENT_SHUFFLE_USE_FP16设为false,即gradient shuffle不使用fp16计算,则先插入cast h2f op
如果ONEFLOW_ONE_EMBEDDING_GRADIENT_SHUFFLE_USE_FP16为true,但是ONEFLOW_ONE_EMBEDDING_NOT_FUSE_CAST_TO_UPDATE,则在update op前插入cast h2f op

@guo-ran guo-ran requested a review from oneflow-ci-bot May 10, 2022 10:04
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8174/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.3ms (= 12932.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.0ms (= 14297.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 143.0ms / 129.3ms)

OneFlow resnet50 time: 81.1ms (= 8108.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.8ms (= 8579.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 85.8ms / 81.1ms)

OneFlow resnet50 time: 53.2ms (= 10647.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.8ms (= 11963.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.12 (= 59.8ms / 53.2ms)

OneFlow resnet50 time: 41.4ms (= 8277.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.5ms (= 8892.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.07 (= 44.5ms / 41.4ms)

OneFlow resnet50 time: 39.1ms (= 7828.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.2ms (= 8036.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.03 (= 40.2ms / 39.1ms)

OneFlow swin dataloader time: 0.257s (= 51.363s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.345s / 200, num_workers=1)
Relative speed: 0.591 (= 0.152s / 0.257s)

OneFlow swin dataloader time: 0.067s (= 13.365s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.125s / 200, num_workers=4)
Relative speed: 0.608 (= 0.041s / 0.067s)

OneFlow swin dataloader time: 0.037s (= 7.384s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.341s / 200, num_workers=8)
Relative speed: 0.588 (= 0.022s / 0.037s)

❌ OneFlow resnet50 time: 145.4ms (= 14535.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.9ms (= 16886.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.9ms / 145.4ms)

OneFlow resnet50 time: 96.6ms (= 9660.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.5ms (= 11152.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 111.5ms / 96.6ms)

OneFlow resnet50 time: 71.5ms (= 14302.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.7ms (= 17739.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 88.7ms / 71.5ms)

OneFlow resnet50 time: 63.7ms (= 12738.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.0ms (= 14803.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.16 (= 74.0ms / 63.7ms)

OneFlow resnet50 time: 57.1ms (= 11416.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.8ms (= 13758.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 68.8ms / 57.1ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8174/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.5ms (= 12950.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.0ms (= 14095.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 141.0ms / 129.5ms)

OneFlow resnet50 time: 80.2ms (= 8023.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.3ms (= 8529.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 85.3ms / 80.2ms)

OneFlow resnet50 time: 52.1ms (= 10413.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.6ms (= 11315.5ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 56.6ms / 52.1ms)

OneFlow resnet50 time: 42.4ms (= 8478.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 46.2ms (= 9242.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.09 (= 46.2ms / 42.4ms)

OneFlow resnet50 time: 38.7ms (= 7743.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 34.5ms (= 6902.8ms / 200, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.89 (= 34.5ms / 38.7ms)

OneFlow swin dataloader time: 0.256s (= 51.253s / 200, num_workers=1)
PyTorch swin dataloader time: 0.153s (= 30.636s / 200, num_workers=1)
Relative speed: 0.598 (= 0.153s / 0.256s)

OneFlow swin dataloader time: 0.069s (= 13.882s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.462s / 200, num_workers=4)
Relative speed: 0.610 (= 0.042s / 0.069s)

OneFlow swin dataloader time: 0.038s (= 7.574s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.534s / 200, num_workers=8)
Relative speed: 0.599 (= 0.023s / 0.038s)

❌ OneFlow resnet50 time: 145.0ms (= 14500.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 176.8ms (= 17678.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 176.8ms / 145.0ms)

OneFlow resnet50 time: 97.6ms (= 9756.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.2ms (= 11221.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 112.2ms / 97.6ms)

OneFlow resnet50 time: 75.6ms (= 15124.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.9ms (= 17579.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.16 (= 87.9ms / 75.6ms)

OneFlow resnet50 time: 64.3ms (= 12863.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.6ms (= 14929.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.16 (= 74.6ms / 64.3ms)

OneFlow resnet50 time: 56.5ms (= 11302.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.4ms (= 13682.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 68.4ms / 56.5ms)

@guo-ran guo-ran removed the request for review from oneflow-ci-bot May 11, 2022 09:27
@guo-ran guo-ran changed the title fix one_embedding amp different data_type bug one_embedding amp default fp16 May 11, 2022
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8174/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.2ms (= 12924.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.6ms (= 14362.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 143.6ms / 129.2ms)

OneFlow resnet50 time: 77.3ms (= 7726.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.2ms (= 8420.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.09 (= 84.2ms / 77.3ms)

OneFlow resnet50 time: 54.6ms (= 10914.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.4ms (= 11677.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.07 (= 58.4ms / 54.6ms)

OneFlow resnet50 time: 41.9ms (= 8382.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 50.0ms (= 10007.1ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.19 (= 50.0ms / 41.9ms)

OneFlow resnet50 time: 34.6ms (= 6920.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 44.1ms (= 8817.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.27 (= 44.1ms / 34.6ms)

OneFlow swin dataloader time: 0.237s (= 47.364s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.209s / 200, num_workers=1)
Relative speed: 0.638 (= 0.151s / 0.237s)

OneFlow swin dataloader time: 0.073s (= 14.632s / 200, num_workers=4)
PyTorch swin dataloader time: 0.044s (= 8.796s / 200, num_workers=4)
Relative speed: 0.601 (= 0.044s / 0.073s)

OneFlow swin dataloader time: 0.038s (= 7.589s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.487s / 200, num_workers=8)
Relative speed: 0.591 (= 0.022s / 0.038s)

❌ OneFlow resnet50 time: 145.7ms (= 14566.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.6ms (= 16856.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.6ms / 145.7ms)

OneFlow resnet50 time: 95.6ms (= 9556.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 110.5ms (= 11054.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 110.5ms / 95.6ms)

OneFlow resnet50 time: 74.8ms (= 14967.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 90.1ms (= 18022.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 90.1ms / 74.8ms)

OneFlow resnet50 time: 65.5ms (= 13103.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.8ms (= 14764.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.13 (= 73.8ms / 65.5ms)

OneFlow resnet50 time: 54.2ms (= 10842.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.7ms (= 14544.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 72.7ms / 54.2ms)

@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@ShawnXuan ShawnXuan enabled auto-merge (squash) May 13, 2022 08:57
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8174/

@ShawnXuan ShawnXuan merged commit 944ad62 into master May 13, 2022
@ShawnXuan ShawnXuan deleted the dev_fix_one_embedding_data_type branch May 13, 2022 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants