Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

multi gpu slower than single setup #189

Closed
hakkiyagiz opened this issue Sep 3, 2019 · 9 comments
Closed

multi gpu slower than single setup #189

hakkiyagiz opened this issue Sep 3, 2019 · 9 comments

Comments

@hakkiyagiz
Copy link

Hi,

I am not sure that but it looks like slowed down when use 2 gpu cards.

When I setup single gpu, training goes with 220 sent./s

But after using multi gpu (2 cards) with same params, training slows down to 85 sent. /sec.

And it generates 2 different foldes under dumped folder.

Is it normal behaviour or do I miss something?

Thanks

@glample
Copy link
Contributor

glample commented Sep 3, 2019

Hi,

So the number of words, sentences/sec is per GPU. So if you run a 2 GPU jobs and each is printing 100 sent/s it means the actual training speed is 200 sent/s. In the ideal scenario, the increase in speed is linear in the number of GPUs, and you would see the same number of sent/s in the training log. But in your case it is slower in any case, so there must be something wrong.

Does it generate 2 different folders? Or two different training log files? You should observe 2 training log files, but in the same folder, along with the params.pkl file.

Can you share the command you used to start your multi-GPU experiment?

@hakkiyagiz
Copy link
Author

Thank you for explanation, as you said there was definitely a problem on my setup.

Single GPU

python train.py --max_epoch 5 --exp_name exp05_2080S_ASUS --dump_path ./dumped --data_path data/processed/XLM_tr/30k --lgs 'tr' --clm_steps '' --mlm_steps 'tr'

And this is the logs which runs on single gpu

0:00:14 - 50 - 208.09 sent/s - 7914.07 words/s - MLM-tr: 8.0562 - - model LR: 1.0000e-04
0:00:15 - 55 - 207.43 sent/s - 8006.92 words/s - MLM-tr: 7.9926 - - model LR: 1.0000e-04
0:00:16 - 60 - 208.96 sent/s - 8033.30 words/s - MLM-tr: 7.9753 - - model LR: 1.0000e-04
0:00:16 - 65 - 208.50 sent/s - 7923.04 words/s - MLM-tr: 7.9684 - - model LR: 1.0000e-04
0:00:17 - 70 - 209.12 sent/s - 7937.30 words/s - MLM-tr: 7.9442 - - model LR: 1.0000e-04
0:00:18 - 75 - 208.48 sent/s - 8040.85 words/s - MLM-tr: 7.8956 - - model LR: 1.0000e-04
0:00:19 - 80 - 208.77 sent/s - 7989.36 words/s - MLM-tr: 7.8368 - - model LR: 1.0000e-04
0:00:19 - 85 - 207.26 sent/s - 7856.36 words/s - MLM-tr: 7.8290 - - model LR: 1.0000e-04
0:00:20 - 90 - 209.05 sent/s - 8047.31 words/s - MLM-tr: 7.8398 - - model LR: 1.0000e-04
0:00:21 - 95 - 207.69 sent/s - 7968.80 words/s - MLM-tr: 7.6982 - - model LR: 1.0000e-04

Multi GPU

export NGPU=2; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --max_epoch 5 --exp_name exp06_2X_2080S --dump_path ./dumped --data_path data/processed/XLM_tr/30k --lgs 'tr' --clm_steps '' --mlm_steps 'tr'

It generates 2 different folders in dumped folder.

foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls
n5t2wqqpfr nnk25v6kjt
foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls n5t2wqqpfr
params.pkl train.log-1
foo@mlgpu:~ /XLM/dumped/exp06_2X_2080S$ ls nnk25v6kjt
hypotheses params.pkl train.log

And here is the log outputs;

nnk25v6kjt/train.log

0:06:33 - 1040 - 86.05 sent/s - 3318.46 words/s - MLM-tr: 7.4536 - - model LR: 1.0000e-04
0:06:35 - 1045 - 86.07 sent/s - 3297.49 words/s - MLM-tr: 7.5419 - - model LR: 1.0000e-04
0:06:37 - 1050 - 86.24 sent/s - 3290.09 words/s - MLM-tr: 7.5144 - - model LR: 1.0000e-04
0:06:39 - 1055 - 85.65 sent/s - 3240.72 words/s - MLM-tr: 7.4447 - - model LR: 1.0000e-04
0:06:41 - 1060 - 86.20 sent/s - 3247.76 words/s - MLM-tr: 7.5086 - - model LR: 1.0000e-04
0:06:43 - 1065 - 86.04 sent/s - 3292.57 words/s - MLM-tr: 7.4742 - - model LR: 1.0000e-04
0:06:44 - 1070 - 86.17 sent/s - 3305.78 words/s - MLM-tr: 7.5722 - - model LR: 1.0000e-04
0:06:46 - 1075 - 86.23 sent/s - 3284.86 words/s - MLM-tr: 7.4754 - - model LR: 1.0000e-04
0:06:48 - 1080 - 86.13 sent/s - 3298.87 words/s - MLM-tr: 7.4685 - - model LR: 1.0000e-04
0:06:50 - 1085 - 85.70 sent/s - 3263.07 words/s - MLM-tr: 7.4339 - - model LR: 1.0000e-04

n5t2wqqpfr/train.log-1

0:06:33 - 1040 - 86.04 sent/s - 3231.26 words/s - MLM-tr: 7.4471 - - model LR: 1.0000e-04
0:06:35 - 1045 - 86.08 sent/s - 3314.58 words/s - MLM-tr: 7.5433 - - model LR: 1.0000e-04
0:06:37 - 1050 - 86.24 sent/s - 3263.61 words/s - MLM-tr: 7.3916 - - model LR: 1.0000e-04
0:06:39 - 1055 - 85.63 sent/s - 3178.64 words/s - MLM-tr: 7.4647 - - model LR: 1.0000e-04
0:06:41 - 1060 - 86.22 sent/s - 3308.16 words/s - MLM-tr: 7.5091 - - model LR: 1.0000e-04
0:06:43 - 1065 - 86.02 sent/s - 3297.96 words/s - MLM-tr: 7.4426 - - model LR: 1.0000e-04
0:06:44 - 1070 - 86.18 sent/s - 3291.05 words/s - MLM-tr: 7.5312 - - model LR: 1.0000e-04
0:06:46 - 1075 - 86.23 sent/s - 3303.14 words/s - MLM-tr: 7.3746 - - model LR: 1.0000e-04
0:06:48 - 1080 - 86.14 sent/s - 3296.01 words/s - MLM-tr: 7.5538 - - model LR: 1.0000e-04

nvidia-smi shows full gpu util. on each cards and 2 different processes running on them:

| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 60% 75C P2 234W / 250W | 5509MiB / 7981MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |
| 36% 59C P2 206W / 250W | 6215MiB / 7982MiB | 97% Default |

| 0 3607 C .../dh/environments/pytorch-env/bin/python 5499MiB |
| 1 3608 C .../dh/environments/pytorch-env/bin/python 6205MiB |

@glample
Copy link
Contributor

glample commented Sep 4, 2019

Mmm this is weird, not sure I understand what is going on here. First thing to check would be to see whether the parameters are actually shared across GPUs.

Could you try after each iteration step to do something like:
logger.info(sum(p.sum().item() for p in model.parameters()))
where model is your encoder and your decoder?
If the printed values of the sum of the parameters are identical, it means that the experiment is indeed multi-GPU. If not, it means the two processes are evolving independently and that there is an issue.

@hakkiyagiz
Copy link
Author

@glample here it is

dumped/exp06_2X_2080S/4t1fvd7v88/train.log

INFO - 09/04/19 14:03:47 - 0:19:30 - ============ End of epoch 0 ============
INFO - 09/04/19 14:03:47 - 0:19:30 - 5131.00750564388
INFO - 09/04/19 14:04:01 - 0:19:44 - epoch -> 0.000000
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - log:{"epoch": 0, "valid_tr_mlm_ppl": 3412.188208856406, "valid_tr_mlm_acc": 4.640797063450446, "valid_mlm_ppl": 3412.188208856406, "valid_mlm_acc": 4.640797063450446, "test_tr_mlm_ppl": 3337.6335580628033, "test_tr_mlm_acc": 4.915392941052242, "test_mlm_ppl": 3337.6335580628033, "test_mlm_acc": 4.915392941052242}
INFO - 09/04/19 14:04:01 - 0:19:44 - Saving checkpoint to ./dumped/exp06_2X_2080S/4t1fvd7v88/checkpoint.pth ...
WARNING - 09/04/19 14:04:01 - 0:19:44 - Saving model parameters ...
WARNING - 09/04/19 14:04:01 - 0:19:44 - Saving model optimizer ...
INFO - 09/04/19 14:04:02 - 0:19:44 - ============ Starting epoch 1 ... ============

dumped/exp06_2X_2080S/rgy0ay2xj9/train.log-1

INFO - 09/04/19 14:03:47 - 0:19:30 - ============ End of epoch 0 ============
INFO - 09/04/19 14:03:47 - 0:19:30 - 5131.00750564388
INFO - 09/04/19 14:04:01 - 0:19:44 - epoch -> 0.000000
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_tr_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_ppl -> 3412.188209
INFO - 09/04/19 14:04:01 - 0:19:44 - valid_mlm_acc -> 4.640797
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_tr_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_ppl -> 3337.633558
INFO - 09/04/19 14:04:01 - 0:19:44 - test_mlm_acc -> 4.915393
INFO - 09/04/19 14:04:01 - 0:19:44 - ============ Starting epoch 1 ... ============

@hakkiyagiz
Copy link
Author

btw these cards are different branded. One of them is gigabyte and the other one is asus. But I don't think it will be caused by this

@glample
Copy link
Contributor

glample commented Sep 4, 2019

This means that the parameters are properly synchronized between the two processes. It is usually a good thing, but here it means that everything is working as expected, so unfortunately I don't understand why the code is slow. I would imagine this is because the communication between your two GPUs is not fast enough. If the cards are different branded, this is maybe not surprising.

What you can try is to use --accumulate_gradients 4 to synchronize the gradients only every 4 batches. For a same number of tokens to process, it will divide by 4 the amount of memory transfer between the two GPUs, and should help. If even this is not satisfactory, I don't really have a solution sorry. Probably the hardware is the issue here.

@hakkiyagiz
Copy link
Author

Hmm,
Maybe my chipset (GA-B75M-D3H) and cpu (i5-3340) should be ancient for this workload. I will check these on recent hardwares asap.

@hakkiyagiz
Copy link
Author

oh crap!

1 x PCI Express x16 slot, running at x4

sorry for taking your time

@Jamiroquai88
Copy link

Hi,

I have very similar problem with speed.

CPU

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

MB

SuperMicro X10DRG-Q

1x RTX 2080

export CUDA_VISIBLE_DEVICES=0
export NGPU=1

python -W ignore::UserWarning -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
    --exp_name xlm_cs-en \
    --dump_path ./dumped \
    --data_path $OUTPATH \
    --lgs 'cs-en' \
    --clm_steps '' \
    --mlm_steps 'cs,en' \
    --emb_dim 1024 \
    --n_layers 12 \
    --n_heads 8 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 4 \
    --bptt 256 \
    --optimizer adam,lr=0.0001 \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_mlm_ppl \
    --stopping_criterion _valid_mlm_ppl,25 \
    --max_vocab 150000 \
    --amp 01 \
    --fp16 true
INFO - 06/22/20 10:51:39 - 0:01:18 -      20 -   14.63 sent/s -   511.97 words/s - MLM-cs:  6.2584 || MLM-en:  7.7495 -  - model LR: 1.0000e-04

10x RTX 2080

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9
export NGPU=10

python -W ignore::UserWarning -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
    --exp_name xlm_cs-en \
    --dump_path ./dumped \
    --data_path $OUTPATH \
    --lgs 'cs-en' \
    --clm_steps '' \
    --mlm_steps 'cs,en' \
    --emb_dim 1024 \
    --n_layers 12 \
    --n_heads 8 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 4 \
    --bptt 256 \
    --optimizer adam,lr=0.0001 \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_mlm_ppl \
    --stopping_criterion _valid_mlm_ppl,25 \
    --max_vocab 150000 \
    --amp 01 \
    --fp16 true \
    --accumulate_gradients 4
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    96.99 words/s - MLM-cs:  7.2759 || MLM-en:  8.0421 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.45 words/s - MLM-cs:  7.0937 || MLM-en:  7.9573 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    93.41 words/s - MLM-cs:  7.6418 || MLM-en:  8.1558 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.94 words/s - MLM-cs:  7.5107 || MLM-en:  8.1327 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    98.00 words/s - MLM-cs:  7.4999 || MLM-en:  8.1016 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.94 words/s - MLM-cs:  7.7264 || MLM-en:  7.9569 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.96 words/s - MLM-cs:  7.1065 || MLM-en:  8.0713 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    95.95 words/s - MLM-cs:  7.4804 || MLM-en:  8.2623 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:48 -      25 -    2.55 sent/s -    96.47 words/s - MLM-cs:  7.4324 || MLM-en:  8.1150 -  - model LR: 1.0000e-04
INFO - 06/22/20 10:47:45 - 0:02:49 -      25 -    2.55 sent/s -    94.42 words/s - MLM-cs:  7.4249 || MLM-en:  7.7899 -  - model LR: 1.0000e-04

I tried printing this after every iteration and it returns same value for all gpus:

logger.info(sum(p.sum().item() for p in model.parameters()))

Also, I am getting this warning:

/tmp/pip-req-build-ocx5vxk7/aten/src/ATen/native/IndexingUtils.h:20: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead.

Overall, it looks like training with 10x RTX 2080 is only a little bit faster than with 1x RTX 2080. I am getting very similar results for another server with 5x1080. I used Horovod for parallel training and it was always fine (not linear, but close enough). Is there something, that I can try?

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants