Training MobileNet with multiple GPU #61

CrazyAlan · 2018-02-23T20:53:32Z

When I am using 4 GPUs to train the network, it seems no speed increase comparing to single GPU.

When the training is running under 4 GPUs mode, every GPU's "GPU-Util" is decreased to 20%~30%. While in single GPU mode, the "GPU-Util" is above 60%. Do you get an idea of the problem?

nttstar · 2018-02-24T08:31:37Z

export MXNET_CPU_WORKER_NTHREADS=24
export MXNET_ENGINE_TYPE=ThreadedEnginePerDevice

Have you done this before training?

CrazyAlan · 2018-02-24T08:44:44Z

@nttstar Yes, attached is the log. When I use single GPU, the speed is also around 800 samples/s

gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 8631
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=7.531499892038736, ctx_num=4, data_dir='/tmp/datasets/faces_vgg_112x112', easy_margin=0, emb_size=128, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=0, lr=0.1, lr_steps='', margin=4, margin_m=0.35, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='m1', num_classes=8631, num_layers=1, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='/tmp/m1_I0-L0-OE-E128', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=0, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init mobilenet 1
(0, 'E', 3)
INFO:root:loading recordio /tmp/datasets/faces_vgg_112x112/train.rec...
header0 label [ 3137808. 3146439.]
id2range 8631
0 0
3137807
rand_mirror 1
(512,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [40000, 60000, 80000]
[21:45:27] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
optimizer_params=optimizer_params)
call reset()
[21:45:48] src/kvstore/././comm.h:653: only 4 out of 12 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[21:45:48] src/kvstore/././comm.h:662: .v..
[21:45:48] src/kvstore/././comm.h:662: v...
[21:45:48] src/kvstore/././comm.h:662: ...v
[21:45:48] src/kvstore/././comm.h:662: ..v.
INFO:root:Epoch[0] Batch [20] Speed: 101.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 432.72 samples/sec acc=0.000195
INFO:root:Epoch[0] Batch [60] Speed: 786.72 samples/sec acc=0.000391
INFO:root:Epoch[0] Batch [80] Speed: 848.32 samples/sec acc=0.000098
INFO:root:Epoch[0] Batch [100] Speed: 827.08 samples/sec acc=0.000586
INFO:root:Epoch[0] Batch [120] Speed: 834.74 samples/sec acc=0.000293
INFO:root:Epoch[0] Batch [140] Speed: 862.26 samples/sec acc=0.000293
INFO:root:Epoch[0] Batch [160] Speed: 885.90 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [180] Speed: 875.62 samples/sec acc=0.000488
INFO:root:Epoch[0] Batch [200] Speed: 879.54 samples/sec acc=0.000488
INFO:root:Epoch[0] Batch [220] Speed: 856.85 samples/sec acc=0.001172
INFO:root:Epoch[0] Batch [240] Speed: 880.83 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [260] Speed: 871.60 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [280] Speed: 872.43 samples/sec acc=0.001074
INFO:root:Epoch[0] Batch [300] Speed: 865.48 samples/sec acc=0.001270
INFO:root:Epoch[0] Batch [320] Speed: 874.28 samples/sec acc=0.001465
INFO:root:Epoch[0] Batch [340] Speed: 863.19 samples/sec acc=0.001465
INFO:root:Epoch[0] Batch [360] Speed: 858.20 samples/sec acc=0.001270

CrazyAlan · 2018-02-24T08:45:30Z

The system memory is 125GB, each GPU's memory is 12GB

nttstar · 2018-02-24T11:31:10Z

Any problem if you use other DL frameworks?
Try change kvstore = 'device', in train_softmax.py to kvstore= 'local', then tell me the result.

nttstar · 2018-02-24T11:35:13Z

@CrazyAlan I see gpu num: 4 in your log.

CrazyAlan · 2018-02-24T19:25:41Z

Hi @nttstar, thanks a lot for the help! I tried to change kvstore= 'local', the performance is still the same.

What I found is that I am using Mobilenet, which is not well implemented under MxNet (compare to Tensorflow) so that most of the time the GPU is empty :(.
When I change to use ResNet, the performance of 4 GPUs is 4X of a single GPU, and the usage is full.

Do you know any high-efficiency Depthwise Conv operation under MxNet? I think that might help with the speed problem. I am new to MxNet, but in Caffe, I was using this (https://github.com/yonghenglh6/DepthwiseConvolution). And it seems to be much faster

nttstar · 2018-02-25T09:06:30Z

Ah, I haven't realized this before. I'm not sure if there's a better implementation. I will update the symbol if there is. Thank you very much!

nttstar · 2018-02-25T14:39:29Z

@CrazyAlan I just did a simple experiment on my test server(with Tesla M40 GPU). About 700 samples/s by using 4 GPUs while only 220 samples/s with single GPU. So there's no problem in my test. Can you check your training parameters again?

CrazyAlan · 2018-02-26T00:03:18Z

@nttstar My training command is python train_softmax.py --network m1 --loss-type 0 --data-dir /tmp/datasets/faces_ms1m_112x112 --prefix /tmp/models/model-m1-softmax --version-input 0,

I am using (Tesla P100), the speed of 4 GPUs or 1 GPU is almost the same (~562.99 samples/sec). Is it possible that the file reading speed causes the problem? When I change the per-batch-size to 256, the speed incresed to ~700 samples/sec (not sure if it would harm the accuracy). I also tried to use vgg_face, the speed is roughly 800 samples/sec.

Can you share me with your speed with the command I have? I don't change the script.

nttstar · 2018-02-26T02:37:13Z

How did you choose number of GPUs? You need to set CUDA_VISIBLE_DEVICES.

nttstar · 2018-02-26T02:38:00Z

You can post your log of single GPU training.

CrazyAlan · 2018-02-26T04:17:49Z

Here is the log with AM-Softmax, I lose the log the softmax, but speed is very similar. It's a server, so I can choose how many numbers of GPUs to use. The speed of ResNet is linear to the number of GPUs.

Due to MODULEPATH changes, the following have been reloaded:
  1) openmpi/2.1.1

gpu num: 1
num_layers 1
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=128, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=1, data_dir='/tmp/datasets/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, fresh=False, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=2, lr=0.1, lr_steps='', margin=4, margin_m=0.2, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='m1', num_classes=85164, num_layers=1, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='/tmp/models/m1-L2-I0-m2', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=0, version_output='E', version_se=0, version_unit=3, wd=0.0005)
Do not find any pretrained model, start from scratch
init mobilenet 1
(0, 'E', 3)
INFO:root:loading recordio /tmp/datasets/faces_ms1m_112x112/train.rec...
header0 label [ 3804847.  3890011.]
id2range 85164
0 0
3804846
rand_mirror 1
(128,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [400000, 560000, 640000]
[02:44:40] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  optimizer_params=optimizer_params)
call reset()
INFO:root:Epoch[0] Batch [20]	Speed: 484.97 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [40]	Speed: 516.08 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [60]	Speed: 519.08 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [80]	Speed: 520.04 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [100]	Speed: 486.07 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [120]	Speed: 514.40 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [140]	Speed: 519.28 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [160]	Speed: 517.01 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [180]	Speed: 517.82 samples/sec	acc=0.000000

nttstar · 2018-02-26T04:22:05Z

I guess it was affected by IO. Did you use SSD?

CrazyAlan · 2018-02-26T04:26:33Z

Hmmm, might be. It's more like a distributed system, so the network should be way much slower than SSD. When I use larger batch size, the number of samples processed per second is increased. Did you guys run any experiments to set per-batch-size to 256 or even 512? And do you know how will it affect the accuracy?

nttstar · 2018-02-26T04:31:58Z

It is very common to obtain better performance when using larger batch size.

nttstar · 2018-02-26T04:34:23Z

You can try larger learning rate when using larger batch size. But I haven't do any experiment other than batch size 512(128*4).

CrazyAlan · 2018-02-26T05:13:17Z

Ok, I will try that. Anyway, thanks for the help, really appreciated!!

AleximusOrloff · 2018-05-23T03:34:10Z

@CrazyAlan
Can you please share your results on training?
I'm trying to reproduce results with Torch7 framework, but my Network learns nothing.
as far as I know it's quite tricky task - to train plain network from scratch (plain meaning without residuals). So have you succeed?

faderani · 2021-01-02T16:44:12Z

@CrazyAlan Have you resolved this. I have the same problem

nttstar changed the title ~~Training with multiple GPU~~ Training MobileNet with multiple GPU Feb 25, 2018

CrazyAlan closed this as completed Feb 26, 2018

FlyingAle mentioned this issue Jun 2, 2023

2d106det convert to ncnn model,run on android will be crushed #2322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training MobileNet with multiple GPU #61

Training MobileNet with multiple GPU #61

CrazyAlan commented Feb 23, 2018

nttstar commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018

nttstar commented Feb 24, 2018

nttstar commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018 •

edited

Loading

nttstar commented Feb 25, 2018

nttstar commented Feb 25, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

AleximusOrloff commented May 23, 2018

faderani commented Jan 2, 2021

Training MobileNet with multiple GPU #61

Training MobileNet with multiple GPU #61

Comments

CrazyAlan commented Feb 23, 2018

nttstar commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018

nttstar commented Feb 24, 2018

nttstar commented Feb 24, 2018

CrazyAlan commented Feb 24, 2018 • edited Loading

nttstar commented Feb 25, 2018

nttstar commented Feb 25, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

nttstar commented Feb 26, 2018

nttstar commented Feb 26, 2018

CrazyAlan commented Feb 26, 2018

AleximusOrloff commented May 23, 2018

faderani commented Jan 2, 2021

CrazyAlan commented Feb 24, 2018 •

edited

Loading