multi gpu slower than single setup #189

hakkiyagiz · 2019-09-03T18:44:54Z

Hi,

I am not sure that but it looks like slowed down when use 2 gpu cards.

When I setup single gpu, training goes with 220 sent./s

But after using multi gpu (2 cards) with same params, training slows down to 85 sent. /sec.

And it generates 2 different foldes under dumped folder.

Is it normal behaviour or do I miss something?

Thanks

glample · 2019-09-03T21:28:21Z

Hi,

So the number of words, sentences/sec is per GPU. So if you run a 2 GPU jobs and each is printing 100 sent/s it means the actual training speed is 200 sent/s. In the ideal scenario, the increase in speed is linear in the number of GPUs, and you would see the same number of sent/s in the training log. But in your case it is slower in any case, so there must be something wrong.

Does it generate 2 different folders? Or two different training log files? You should observe 2 training log files, but in the same folder, along with the params.pkl file.

Can you share the command you used to start your multi-GPU experiment?

hakkiyagiz · 2019-09-04T08:05:07Z

Thank you for explanation, as you said there was definitely a problem on my setup.

Single GPU

python train.py --max_epoch 5 --exp_name exp05_2080S_ASUS --dump_path ./dumped --data_path data/processed/XLM_tr/30k --lgs 'tr' --clm_steps '' --mlm_steps 'tr'

And this is the logs which runs on single gpu

0:00:14 - 50 - 208.09 sent/s - 7914.07 words/s - MLM-tr: 8.0562 - - model LR: 1.0000e-04
0:00:15 - 55 - 207.43 sent/s - 8006.92 words/s - MLM-tr: 7.9926 - - model LR: 1.0000e-04
0:00:16 - 60 - 208.96 sent/s - 8033.30 words/s - MLM-tr: 7.9753 - - model LR: 1.0000e-04
0:00:16 - 65 - 208.50 sent/s - 7923.04 words/s - MLM-tr: 7.9684 - - model LR: 1.0000e-04
0:00:17 - 70 - 209.12 sent/s - 7937.30 words/s - MLM-tr: 7.9442 - - model LR: 1.0000e-04
0:00:18 - 75 - 208.48 sent/s - 8040.85 words/s - MLM-tr: 7.8956 - - model LR: 1.0000e-04
0:00:19 - 80 - 208.77 sent/s - 7989.36 words/s - MLM-tr: 7.8368 - - model LR: 1.0000e-04
0:00:19 - 85 - 207.26 sent/s - 7856.36 words/s - MLM-tr: 7.8290 - - model LR: 1.0000e-04
0:00:20 - 90 - 209.05 sent/s - 8047.31 words/s - MLM-tr: 7.8398 - - model LR: 1.0000e-04
0:00:21 - 95 - 207.69 sent/s - 7968.80 words/s - MLM-tr: 7.6982 - - model LR: 1.0000e-04

Multi GPU

export NGPU=2; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --max_epoch 5 --exp_name exp06_2X_2080S --dump_path ./dumped --data_path data/processed/XLM_tr/30k --lgs 'tr' --clm_steps '' --mlm_steps 'tr'

It generates 2 different folders in dumped folder.

foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls
n5t2wqqpfr nnk25v6kjt
foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls n5t2wqqpfr
params.pkl train.log-1
foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls nnk25v6kjt
hypotheses params.pkl train.log

And here is the log outputs;

nnk25v6kjt/train.log

0:06:33 - 1040 - 86.05 sent/s - 3318.46 words/s - MLM-tr: 7.4536 - - model LR: 1.0000e-04
0:06:35 - 1045 - 86.07 sent/s - 3297.49 words/s - MLM-tr: 7.5419 - - model LR: 1.0000e-04
0:06:37 - 1050 - 86.24 sent/s - 3290.09 words/s - MLM-tr: 7.5144 - - model LR: 1.0000e-04
0:06:39 - 1055 - 85.65 sent/s - 3240.72 words/s - MLM-tr: 7.4447 - - model LR: 1.0000e-04
0:06:41 - 1060 - 86.20 sent/s - 3247.76 words/s - MLM-tr: 7.5086 - - model LR: 1.0000e-04
0:06:43 - 1065 - 86.04 sent/s - 3292.57 words/s - MLM-tr: 7.4742 - - model LR: 1.0000e-04
0:06:44 - 1070 - 86.17 sent/s - 3305.78 words/s - MLM-tr: 7.5722 - - model LR: 1.0000e-04
0:06:46 - 1075 - 86.23 sent/s - 3284.86 words/s - MLM-tr: 7.4754 - - model LR: 1.0000e-04
0:06:48 - 1080 - 86.13 sent/s - 3298.87 words/s - MLM-tr: 7.4685 - - model LR: 1.0000e-04
0:06:50 - 1085 - 85.70 sent/s - 3263.07 words/s - MLM-tr: 7.4339 - - model LR: 1.0000e-04

n5t2wqqpfr/train.log-1

0:06:33 - 1040 - 86.04 sent/s - 3231.26 words/s - MLM-tr: 7.4471 - - model LR: 1.0000e-04
0:06:35 - 1045 - 86.08 sent/s - 3314.58 words/s - MLM-tr: 7.5433 - - model LR: 1.0000e-04
0:06:37 - 1050 - 86.24 sent/s - 3263.61 words/s - MLM-tr: 7.3916 - - model LR: 1.0000e-04
0:06:39 - 1055 - 85.63 sent/s - 3178.64 words/s - MLM-tr: 7.4647 - - model LR: 1.0000e-04
0:06:41 - 1060 - 86.22 sent/s - 3308.16 words/s - MLM-tr: 7.5091 - - model LR: 1.0000e-04
0:06:43 - 1065 - 86.02 sent/s - 3297.96 words/s - MLM-tr: 7.4426 - - model LR: 1.0000e-04
0:06:44 - 1070 - 86.18 sent/s - 3291.05 words/s - MLM-tr: 7.5312 - - model LR: 1.0000e-04
0:06:46 - 1075 - 86.23 sent/s - 3303.14 words/s - MLM-tr: 7.3746 - - model LR: 1.0000e-04
0:06:48 - 1080 - 86.14 sent/s - 3296.01 words/s - MLM-tr: 7.5538 - - model LR: 1.0000e-04

nvidia-smi shows full gpu util. on each cards and 2 different processes running on them:

| 0 3607 C .../dh/environments/pytorch-env/bin/python 5499MiB |
| 1 3608 C .../dh/environments/pytorch-env/bin/python 6205MiB |

glample · 2019-09-04T10:18:50Z

Mmm this is weird, not sure I understand what is going on here. First thing to check would be to see whether the parameters are actually shared across GPUs.

Could you try after each iteration step to do something like:
logger.info(sum(p.sum().item() for p in model.parameters()))
where model is your encoder and your decoder?
If the printed values of the sum of the parameters are identical, it means that the experiment is indeed multi-GPU. If not, it means the two processes are evolving independently and that there is an issue.

hakkiyagiz · 2019-09-04T14:14:26Z

@glample here it is

dumped/exp06_2X_2080S/4t1fvd7v88/train.log

INFO - 09/04/19 14:03:47 - 0:19:30 - ============ End of epoch 0 ============
INFO - 09/04/19 14:03:47 - 0:19:30 - 5131.00750564388
INFO - 09/04/19 14:04:01 - 0:19:44 - epoch -> 0.000000
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - log:{"epoch": 0, "valid_tr_mlm_ppl": 3412.188208856406, "valid_tr_mlm_acc": 4.640797063450446, "valid_mlm_ppl": 3412.188208856406, "valid_mlm_acc": 4.640797063450446, "test_tr_mlm_ppl": 3337.6335580628033, "test_tr_mlm_acc": 4.915392941052242, "test_mlm_ppl": 3337.6335580628033, "test_mlm_acc": 4.915392941052242}
INFO - 09/04/19 14:04:01 - 0:19:44 - Saving checkpoint to ./dumped/exp06_2X_2080S/4t1fvd7v88/checkpoint.pth ...
WARNING - 09/04/19 14:04:01 - 0:19:44 - Saving model parameters ...
WARNING - 09/04/19 14:04:01 - 0:19:44 - Saving model optimizer ...
INFO - 09/04/19 14:04:02 - 0:19:44 - ============ Starting epoch 1 ... ============

dumped/exp06_2X_2080S/rgy0ay2xj9/train.log-1

INFO - 09/04/19 14:03:47 - 0:19:30 - ============ End of epoch 0 ============
INFO - 09/04/19 14:03:47 - 0:19:30 - 5131.00750564388
INFO - 09/04/19 14:04:01 - 0:19:44 - epoch -> 0.000000
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - ============ Starting epoch 1 ... ============

hakkiyagiz · 2019-09-04T14:16:33Z

btw these cards are different branded. One of them is gigabyte and the other one is asus. But I don't think it will be caused by this

glample · 2019-09-04T15:44:00Z

This means that the parameters are properly synchronized between the two processes. It is usually a good thing, but here it means that everything is working as expected, so unfortunately I don't understand why the code is slow. I would imagine this is because the communication between your two GPUs is not fast enough. If the cards are different branded, this is maybe not surprising.

What you can try is to use --accumulate_gradients 4 to synchronize the gradients only every 4 batches. For a same number of tokens to process, it will divide by 4 the amount of memory transfer between the two GPUs, and should help. If even this is not satisfactory, I don't really have a solution sorry. Probably the hardware is the issue here.

hakkiyagiz · 2019-09-04T16:01:07Z

Hmm,
Maybe my chipset (GA-B75M-D3H) and cpu (i5-3340) should be ancient for this workload. I will check these on recent hardwares asap.

hakkiyagiz · 2019-09-04T16:16:36Z

oh crap!

1 x PCI Express x16 slot, running at x4

sorry for taking your time

Jamiroquai88 · 2020-06-22T09:59:09Z

Hi,

I have very similar problem with speed.

CPU

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

MB

SuperMicro X10DRG-Q

1x RTX 2080

export CUDA_VISIBLE_DEVICES=0
export NGPU=1

python -W ignore::UserWarning -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
    --exp_name xlm_cs-en \
    --dump_path ./dumped \
    --data_path $OUTPATH \
    --lgs 'cs-en' \
    --clm_steps '' \
    --mlm_steps 'cs,en' \
    --emb_dim 1024 \
    --n_layers 12 \
    --n_heads 8 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 4 \
    --bptt 256 \
    --optimizer adam,lr=0.0001 \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_mlm_ppl \
    --stopping_criterion _valid_mlm_ppl,25 \
    --max_vocab 150000 \
    --amp 01 \
    --fp16 true

INFO - 06/22/20 10:51:39 - 0:01:18 -      20 -   14.63 sent/s -   511.97 words/s - MLM-cs:  6.2584 || MLM-en:  7.7495 -  - model LR: 1.0000e-04

10x RTX 2080

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9
export NGPU=10

python -W ignore::UserWarning -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
    --exp_name xlm_cs-en \
    --dump_path ./dumped \
    --data_path $OUTPATH \
    --lgs 'cs-en' \
    --clm_steps '' \
    --mlm_steps 'cs,en' \
    --emb_dim 1024 \
    --n_layers 12 \
    --n_heads 8 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 4 \
    --bptt 256 \
    --optimizer adam,lr=0.0001 \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_mlm_ppl \
    --stopping_criterion _valid_mlm_ppl,25 \
    --max_vocab 150000 \
    --amp 01 \
    --fp16 true \
    --accumulate_gradients 4

INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    96.99 words/s - MLM-cs:  7.2759 || MLM-en:  8.0421 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.45 words/s - MLM-cs:  7.0937 || MLM-en:  7.9573 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    93.41 words/s - MLM-cs:  7.6418 || MLM-en:  8.1558 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.94 words/s - MLM-cs:  7.5107 || MLM-en:  8.1327 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    98.00 words/s - MLM-cs:  7.4999 || MLM-en:  8.1016 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.94 words/s - MLM-cs:  7.7264 || MLM-en:  7.9569 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.96 words/s - MLM-cs:  7.1065 || MLM-en:  8.0713 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.95 words/s - MLM-cs:  7.4804 || MLM-en:  8.2623 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    96.47 words/s - MLM-cs:  7.4324 || MLM-en:  8.1150 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.42 words/s - MLM-cs:  7.4249 || MLM-en:  7.7899 -  - model LR: 1.0000e-04

I tried printing this after every iteration and it returns same value for all gpus:

logger.info(sum(p.sum().item() for p in model.parameters()))

Also, I am getting this warning:

/tmp/pip-req-build-ocx5vxk7/aten/src/ATen/native/IndexingUtils.h:20: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead.

Overall, it looks like training with 10x RTX 2080 is only a little bit faster than with 1x RTX 2080. I am getting very similar results for another server with 5x1080. I used Horovod for parallel training and it was always fine (not linear, but close enough). Is there something, that I can try?

Thanks.

hakkiyagiz closed this as completed Sep 4, 2019

Jamiroquai88 mentioned this issue Jun 23, 2020

Multi-GPU training extremely slow #299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpu slower than single setup #189

multi gpu slower than single setup #189

hakkiyagiz commented Sep 3, 2019

glample commented Sep 3, 2019

hakkiyagiz commented Sep 4, 2019

glample commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

glample commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

Jamiroquai88 commented Jun 22, 2020

multi gpu slower than single setup #189

multi gpu slower than single setup #189

Comments

hakkiyagiz commented Sep 3, 2019

glample commented Sep 3, 2019

hakkiyagiz commented Sep 4, 2019

Single GPU

Multi GPU

glample commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

glample commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

hakkiyagiz commented Sep 4, 2019

Jamiroquai88 commented Jun 22, 2020

CPU

MB

1x RTX 2080

10x RTX 2080