本测试基于 NVIDIA/DeepLearningExamples 仓库中提供的 MXNet框架的 ResNet50 v1.5 实现,在 NVIDIA 官方提供的 MXNet 20.03 NGC 镜像及其衍生容器中进行单机单卡、单机多卡的结果复现及速度评测,并使用Horovod进行多机(2机、4机)的训练,得到吞吐率及加速比,评判框架在分布式多机训练情况下的横向拓展能力。
目前,该测试已覆盖 FP32、FP16混合精度,后续将持续维护,增加更多方式的测评。
-
- GPU:8x Tesla V100-SXM2-16GB
-
-
驱动:NVIDIA 440.33.01
-
系统: Ubuntu 16.04
-
CUDA:10.2
-
cuDNN:7.6.5
-
-
系统: Ubuntu 18.04
-
CUDA 10.2.89
-
cuDNN 7.6.5
-
NCCL:2.5.6
-
MXNet:1.6.0
-
OpenMPI 3.1.4
-
DALI 0.19.0
-
Horovod 0.19.0
-
Python:3.6.9
更多容器细节请参考 NVIDIA Container Support Matrix。
Feature ResNet50 v1.5 MXNet Horovod/MPI Multi-GPU Yes Horovod/MPI Multi-Node Yes NVIDIA DALI Yes Automatic mixed precision (AMP) Yes
-
数据集制作参考NVIDIA官方提供的MXNet数据集制作方法
-
同时,根据 NVIDIA 官方指导 Quick Start Guide下载源码、拉取镜像(本次测试选用的是 NGC 20.03)、搭建容器,进入容器环境。
git clone https://github.com/NVIDIA/DeepLearningExamples.git git checkout e470c2150abf4179f873cabad23945bbc920cc5f cd DeepLearningExamples/MxNet/Classification/RN50v1.5/ # 构建项目镜像 # DeepLearningExamples/MxNet/Classification/RN50v1.5/目录下 docker build . -t nvidia_rn50_mx:20.03 # 启动容器 docker run -it \ --shm-size=16g --ulimit memlock=-1 --privileged --net=host \ --name mxnet_dlperf \ -v /home/leinao/DLPerf/dataset/mxnet:/data/imagenet/train-val-recordio-passthrough \ -v /home/leinao/DLPerf/:/DLPerf/ \ nvidia_rn50_mx:20.03
-
单机测试下无需配置,但测试 2 机、4 机等多机情况下,则需要配置 docker 容器间的 ssh 免密登录,保证MXNet 的 mpi 分布式脚本运行时可以在单机上与其他节点互联。
安装ssh服务端
# 在容器内执行 apt-get update apt-get install openssh-server
设置免密登录
- 节点间的 /root/.ssh/id_rsa.pub 互相授权,添加到 /root/.ssh/authorized_keys 中;
- 修改 sshd 中用于 docker 通信的端口号
vim /etc/ssh/sshd_config
,修改Port
; - 重启 ssh 服务,
service ssh restart
。
-
注意: NVIDIA DeepLearningExamples 仓库的MXNet最新脚本用的还是19.07的镜像:
FROM nvcr.io/nvidia/mxnet:19.07-py3
DLPerf仓库中的测试为了统一环境和驱动、第三方依赖的版本,都用了20.03的镜像。mxnet-20.03的镜像里用了与CUDA 10.2相匹配的DALI 0.19.0,而该容器内的脚本还是19.07镜像里的脚本,直接运行dali-cpu和dali-gpu会报错,因此需要做一些修改:
-
修改Line34 ----dali-fuse-decoder,将default=1改为default=0
-
把 dali.py 中的
nvJPEGDecoder
替换成ImageDecoder
,详见: NVIDIA/DALI#906
-
本次测试集群中有 4 台节点:
- NODE1=10.11.0.2
- NODE2=10.11.0.3
- NODE3=10.11.0.4
- NODE4=10.11.0.5
每个节点有 8 张 V100 显卡, 每张显卡显存 16 G。
在容器内下载本仓库源码:
git clone https://github.com/Oneflow-Inc/DLPerf.git
将本仓库 DLPerf/NVIDIADeepLearningExamples/MxNet/Classification/RN50v1.5/ 路径源码放至 /workspace/rn50 下,执行脚本
bash run_test.sh
针对1机1卡、1机8卡、2机16卡、4机32卡, batch_size_per_device = 128,进行测试。
默认测试FP32、batch size=128,也可以指定其他batch size,如64:
bash run_test.sh 64
修改run_test.sh中的DTYPE参数为"fp16"即可,或者运行脚本时指定参数,如:
bash run_test.sh 256 fp16
即可对batch size=256,FP16混合精度的条件进行测试。
测试进行了多组训练(本测试中取 7 次),每次训练过程只取第 1 个 epoch 的前 120 iter,计算训练速度时去掉前 20 iter,只取后 100 iter 的数据,以降低抖动。最后将 7 次训练的速度取中位数得到最终速度,并最终以此数据计算加速比。
运行,即可得到针对不同配置测试 log 数据处理的结果:
python extract_mxnet_logs.py --log_dir=logs/ngc/mxnet/resnet50/bz128 --batch_size_per_device=128
结果打印如下:
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_1.log {1: 11434.12}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_4.log {1: 11434.12, 4: 11305.35}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_7.log {1: 11434.12, 4: 11305.35, 7: 11461.68}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_2.log {1: 11434.12, 4: 11305.35, 7: 11461.68, 2: 11331.93}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_3.log {1: 11434.12, 4: 11305.35, 7: 11461.68, 2: 11331.93, 3: 11429.36}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_6.log {1: 11434.12, 4: 11305.35, 7: 11461.68, 2: 11331.93, 3: 11429.36, 6: 11313.33}
logs/ngc/mxnet/resnet50/bz128/4n8g/r50_b128_fp32_5.log {1: 11434.12, 4: 11305.35, 7: 11461.68, 2: 11331.93, 3: 11429.36, 6: 11313.33, 5: 11283.49}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_1.log {1: 3008.52}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_4.log {1: 3008.52, 4: 3009.46}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_7.log {1: 3008.52, 4: 3009.46, 7: 2999.97}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_2.log {1: 3008.52, 4: 3009.46, 7: 2999.97, 2: 3001.01}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_3.log {1: 3008.52, 4: 3009.46, 7: 2999.97, 2: 3001.01, 3: 2993.87}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_6.log {1: 3008.52, 4: 3009.46, 7: 2999.97, 2: 3001.01, 3: 2993.87, 6: 3008.01}
logs/ngc/mxnet/resnet50/bz128/1n8g/r50_b128_fp32_5.log {1: 3008.52, 4: 3009.46, 7: 2999.97, 2: 3001.01, 3: 2993.87, 6: 3008.01, 5: 3006.98}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_1.log {1: 1520.55}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_4.log {1: 1520.55, 4: 1518.04}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_7.log {1: 1520.55, 4: 1518.04, 7: 1517.28}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_2.log {1: 1520.55, 4: 1518.04, 7: 1517.28, 2: 1521.26}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_3.log {1: 1520.55, 4: 1518.04, 7: 1517.28, 2: 1521.26, 3: 1522.3}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_6.log {1: 1520.55, 4: 1518.04, 7: 1517.28, 2: 1521.26, 3: 1522.3, 6: 1517.98}
logs/ngc/mxnet/resnet50/bz128/1n4g/r50_b128_fp32_5.log {1: 1520.55, 4: 1518.04, 7: 1517.28, 2: 1521.26, 3: 1522.3, 6: 1517.98, 5: 1516.09}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_1.log {1: 392.24}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_4.log {1: 392.24, 4: 393.53}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_7.log {1: 392.24, 4: 393.53, 7: 391.77}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_2.log {1: 392.24, 4: 393.53, 7: 391.77, 2: 390.09}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_3.log {1: 392.24, 4: 393.53, 7: 391.77, 2: 390.09, 3: 392.63}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_6.log {1: 392.24, 4: 393.53, 7: 391.77, 2: 390.09, 3: 392.63, 6: 392.85}
logs/ngc/mxnet/resnet50/bz128/1n1g/r50_b128_fp32_5.log {1: 392.24, 4: 393.53, 7: 391.77, 2: 390.09, 3: 392.63, 6: 392.85, 5: 391.58}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_1.log {1: 767.39}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_4.log {1: 767.39, 4: 764.6}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_7.log {1: 767.39, 4: 764.6, 7: 761.98}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_2.log {1: 767.39, 4: 764.6, 7: 761.98, 2: 765.98}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_3.log {1: 767.39, 4: 764.6, 7: 761.98, 2: 765.98, 3: 763.76}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_6.log {1: 767.39, 4: 764.6, 7: 761.98, 2: 765.98, 3: 763.76, 6: 767.85}
logs/ngc/mxnet/resnet50/bz128/1n2g/r50_b128_fp32_5.log {1: 767.39, 4: 764.6, 7: 761.98, 2: 765.98, 3: 763.76, 6: 767.85, 5: 761.6}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_1.log {1: 5758.49}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_4.log {1: 5758.49, 4: 5755.92}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_7.log {1: 5758.49, 4: 5755.92, 7: 5803.52}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_2.log {1: 5758.49, 4: 5755.92, 7: 5803.52, 2: 5685.5}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_3.log {1: 5758.49, 4: 5755.92, 7: 5803.52, 2: 5685.5, 3: 5717.04}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_6.log {1: 5758.49, 4: 5755.92, 7: 5803.52, 2: 5685.5, 3: 5717.04, 6: 5765.72}
logs/ngc/mxnet/resnet50/bz128/2n8g/r50_b128_fp32_5.log {1: 5758.49, 4: 5755.92, 7: 5803.52, 2: 5685.5, 3: 5717.04, 6: 5765.72, 5: 5767.21}
{'r50': {'1n1g': {'average_speed': 392.1,
'batch_size_per_device': 128,
'median_speed': 392.24,
'speedup': 1.0},
'1n2g': {'average_speed': 764.74,
'batch_size_per_device': 128,
'median_speed': 764.6,
'speedup': 1.95},
'1n4g': {'average_speed': 1519.07,
'batch_size_per_device': 128,
'median_speed': 1518.04,
'speedup': 3.87},
'1n8g': {'average_speed': 3003.97,
'batch_size_per_device': 128,
'median_speed': 3006.98,
'speedup': 7.67},
'2n8g': {'average_speed': 5750.49,
'batch_size_per_device': 128,
'median_speed': 5758.49,
'speedup': 14.68},
'4n8g': {'average_speed': 11365.61,
'batch_size_per_device': 128,
'median_speed': 11331.93,
'speedup': 28.89}}}
Saving result to ./result/bz128_result.json
- extract_mxnet_logs.py
- extract_mxnet_logs_time.py
两个脚本略有不同,得到的结果稍有误差:
extract_mxnet_logs.py根据官方在log中打印的速度,在120个iter中,排除前20iter,取后100个iter的速度做平均;
extract_mxnet_logs_time.py根据batch size和120个iter中,排除前20iter,取后100个iter的实际运行时间计算速度。
本Readme展示的是extract_mxnet_logs_time.py的计算结果。
-
average_speed均值速度
-
median_speed中值速度
每个batch size进行7次训练测试,记为一组,每一组取average_speed为均值速度,median_speed为中值速度。
脚本和表格中的 加速比 是以单机单卡下的中值速度为基准进行计算的。例如:
单机单卡情况下速度为200(samples/s),单机2卡速度为400,单机4卡速度为700,则加速比分别为:1.0、2.0、3.5
该小节提供针对 NVIDIA MXNet 框架的 ResNet50 v1.5 模型单机测试的性能结果和完整 log 日志。
node_num | gpu_num | samples/s | speedup |
---|---|---|---|
1 | 1 | 392.24 | 1.00 |
1 | 2 | 764.6 | 1.95 |
1 | 4 | 1518.04 | 3.87 |
1 | 8 | 3006.98 | 7.67 |
2 | 16 | 5758.49 | 14.68 |
4 | 32 | 11331.93 | 28.89 |
node_num | gpu_num | samples/s | speedup |
---|---|---|---|
1 | 1 | 1281.83 | 1 |
1 | 4 | 4811.72 | 3.75 |
1 | 8 | 9241.68 | 7.21 |
2 | 16 | 13348.68 | 10.41 |
4 | 32 | 27558.28 | 21.50 |
without amp即去掉runner.sh第42行的--amp参数,去掉后即将不会开启动态loss scaling
node_num | gpu_num | samples/s | speedup |
---|---|---|---|
1 | 1 | 1393.87 | 1 |
1 | 4 | 5158.71 | 3.7 |
1 | 8 | 9621.31 | 6.9 |
2 | 16 | 16219.03 | 11.64 |
4 | 32 | 30713.68 | 22.03 |
NVIDIA的 MXNet 官方测评结果详见 ResNet50 v1.5 For MXNet results
详细 Log 信息可下载: