Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training MobileNet with multiple GPU #61

Closed
CrazyAlan opened this issue Feb 23, 2018 · 19 comments
Closed

Training MobileNet with multiple GPU #61

CrazyAlan opened this issue Feb 23, 2018 · 19 comments

Comments

@CrazyAlan
Copy link

When I am using 4 GPUs to train the network, it seems no speed increase comparing to single GPU.

When the training is running under 4 GPUs mode, every GPU's "GPU-Util" is decreased to 20%~30%. While in single GPU mode, the "GPU-Util" is above 60%. Do you get an idea of the problem?

@nttstar
Copy link
Collaborator

nttstar commented Feb 24, 2018

export MXNET_CPU_WORKER_NTHREADS=24
export MXNET_ENGINE_TYPE=ThreadedEnginePerDevice

Have you done this before training?

@CrazyAlan
Copy link
Author

@nttstar Yes, attached is the log. When I use single GPU, the speed is also around 800 samples/s

gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 8631
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=7.531499892038736, ctx_num=4, data_dir='/tmp/datasets/faces_vgg_112x112', easy_margin=0, emb_size=128, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=0, lr=0.1, lr_steps='', margin=4, margin_m=0.35, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='m1', num_classes=8631, num_layers=1, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='/tmp/m1_I0-L0-OE-E128', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=0, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init mobilenet 1
(0, 'E', 3)
INFO:root:loading recordio /tmp/datasets/faces_vgg_112x112/train.rec...
header0 label [ 3137808. 3146439.]
id2range 8631
0 0
3137807
rand_mirror 1
(512,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [40000, 60000, 80000]
[21:45:27] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
optimizer_params=optimizer_params)
call reset()
[21:45:48] src/kvstore/././comm.h:653: only 4 out of 12 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[21:45:48] src/kvstore/././comm.h:662: .v..
[21:45:48] src/kvstore/././comm.h:662: v...
[21:45:48] src/kvstore/././comm.h:662: ...v
[21:45:48] src/kvstore/././comm.h:662: ..v.
INFO:root:Epoch[0] Batch [20] Speed: 101.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 432.72 samples/sec acc=0.000195
INFO:root:Epoch[0] Batch [60] Speed: 786.72 samples/sec acc=0.000391
INFO:root:Epoch[0] Batch [80] Speed: 848.32 samples/sec acc=0.000098
INFO:root:Epoch[0] Batch [100] Speed: 827.08 samples/sec acc=0.000586
INFO:root:Epoch[0] Batch [120] Speed: 834.74 samples/sec acc=0.000293
INFO:root:Epoch[0] Batch [140] Speed: 862.26 samples/sec acc=0.000293
INFO:root:Epoch[0] Batch [160] Speed: 885.90 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [180] Speed: 875.62 samples/sec acc=0.000488
INFO:root:Epoch[0] Batch [200] Speed: 879.54 samples/sec acc=0.000488
INFO:root:Epoch[0] Batch [220] Speed: 856.85 samples/sec acc=0.001172
INFO:root:Epoch[0] Batch [240] Speed: 880.83 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [260] Speed: 871.60 samples/sec acc=0.000879
INFO:root:Epoch[0] Batch [280] Speed: 872.43 samples/sec acc=0.001074
INFO:root:Epoch[0] Batch [300] Speed: 865.48 samples/sec acc=0.001270
INFO:root:Epoch[0] Batch [320] Speed: 874.28 samples/sec acc=0.001465
INFO:root:Epoch[0] Batch [340] Speed: 863.19 samples/sec acc=0.001465
INFO:root:Epoch[0] Batch [360] Speed: 858.20 samples/sec acc=0.001270

@CrazyAlan
Copy link
Author

The system memory is 125GB, each GPU's memory is 12GB

@nttstar
Copy link
Collaborator

nttstar commented Feb 24, 2018

Any problem if you use other DL frameworks?
Try change kvstore = 'device', in train_softmax.py to kvstore= 'local', then tell me the result.

@nttstar
Copy link
Collaborator

nttstar commented Feb 24, 2018

@CrazyAlan I see gpu num: 4 in your log.

@CrazyAlan
Copy link
Author

CrazyAlan commented Feb 24, 2018

Hi @nttstar, thanks a lot for the help! I tried to change kvstore= 'local', the performance is still the same.

What I found is that I am using Mobilenet, which is not well implemented under MxNet (compare to Tensorflow) so that most of the time the GPU is empty :(.
When I change to use ResNet, the performance of 4 GPUs is 4X of a single GPU, and the usage is full.

Do you know any high-efficiency Depthwise Conv operation under MxNet? I think that might help with the speed problem. I am new to MxNet, but in Caffe, I was using this (https://github.com/yonghenglh6/DepthwiseConvolution). And it seems to be much faster

@nttstar
Copy link
Collaborator

nttstar commented Feb 25, 2018

Ah, I haven't realized this before. I'm not sure if there's a better implementation. I will update the symbol if there is. Thank you very much!

@nttstar nttstar changed the title Training with multiple GPU Training MobileNet with multiple GPU Feb 25, 2018
@nttstar
Copy link
Collaborator

nttstar commented Feb 25, 2018

@CrazyAlan I just did a simple experiment on my test server(with Tesla M40 GPU). About 700 samples/s by using 4 GPUs while only 220 samples/s with single GPU. So there's no problem in my test. Can you check your training parameters again?

@CrazyAlan
Copy link
Author

@nttstar My training command is python train_softmax.py --network m1 --loss-type 0 --data-dir /tmp/datasets/faces_ms1m_112x112 --prefix /tmp/models/model-m1-softmax --version-input 0,

I am using (Tesla P100), the speed of 4 GPUs or 1 GPU is almost the same (~562.99 samples/sec). Is it possible that the file reading speed causes the problem? When I change the per-batch-size to 256, the speed incresed to ~700 samples/sec (not sure if it would harm the accuracy). I also tried to use vgg_face, the speed is roughly 800 samples/sec.

Can you share me with your speed with the command I have? I don't change the script.

@nttstar
Copy link
Collaborator

nttstar commented Feb 26, 2018

How did you choose number of GPUs? You need to set CUDA_VISIBLE_DEVICES.

@nttstar
Copy link
Collaborator

nttstar commented Feb 26, 2018

You can post your log of single GPU training.

@CrazyAlan
Copy link
Author

Here is the log with AM-Softmax, I lose the log the softmax, but speed is very similar. It's a server, so I can choose how many numbers of GPUs to use. The speed of ResNet is linear to the number of GPUs.

Due to MODULEPATH changes, the following have been reloaded:
  1) openmpi/2.1.1

gpu num: 1
num_layers 1
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=128, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=1, data_dir='/tmp/datasets/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, fresh=False, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=2, lr=0.1, lr_steps='', margin=4, margin_m=0.2, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='m1', num_classes=85164, num_layers=1, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='/tmp/models/m1-L2-I0-m2', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=0, version_output='E', version_se=0, version_unit=3, wd=0.0005)
Do not find any pretrained model, start from scratch
init mobilenet 1
(0, 'E', 3)
INFO:root:loading recordio /tmp/datasets/faces_ms1m_112x112/train.rec...
header0 label [ 3804847.  3890011.]
id2range 85164
0 0
3804846
rand_mirror 1
(128,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [400000, 560000, 640000]
[02:44:40] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  optimizer_params=optimizer_params)
call reset()
INFO:root:Epoch[0] Batch [20]	Speed: 484.97 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [40]	Speed: 516.08 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [60]	Speed: 519.08 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [80]	Speed: 520.04 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [100]	Speed: 486.07 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [120]	Speed: 514.40 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [140]	Speed: 519.28 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [160]	Speed: 517.01 samples/sec	acc=0.000000
INFO:root:Epoch[0] Batch [180]	Speed: 517.82 samples/sec	acc=0.000000

@nttstar
Copy link
Collaborator

nttstar commented Feb 26, 2018

I guess it was affected by IO. Did you use SSD?

@CrazyAlan
Copy link
Author

Hmmm, might be. It's more like a distributed system, so the network should be way much slower than SSD. When I use larger batch size, the number of samples processed per second is increased. Did you guys run any experiments to set per-batch-size to 256 or even 512? And do you know how will it affect the accuracy?

@nttstar
Copy link
Collaborator

nttstar commented Feb 26, 2018

It is very common to obtain better performance when using larger batch size.

@nttstar
Copy link
Collaborator

nttstar commented Feb 26, 2018

You can try larger learning rate when using larger batch size. But I haven't do any experiment other than batch size 512(128*4).

@CrazyAlan
Copy link
Author

Ok, I will try that. Anyway, thanks for the help, really appreciated!!

@AleximusOrloff
Copy link

@CrazyAlan
Can you please share your results on training?
I'm trying to reproduce results with Torch7 framework, but my Network learns nothing.
as far as I know it's quite tricky task - to train plain network from scratch (plain meaning without residuals). So have you succeed?

@faderani
Copy link

faderani commented Jan 2, 2021

@CrazyAlan Have you resolved this. I have the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants