-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Slow training in machines with multiple GPUs #3265
Comments
@pvcastro I would suggest upgrading. It's pretty hard for us to investigate issues on PyTorch 0.4.1. Possibly @amarasovic or @brendan-ai2 might be able to relate their experiences running a different model on multiple GPUs. I know some people in the group do that regularly. |
Hi @schmmd ! |
@pvcastro apologies--I misread--thank you for clarifying. I don't think it's normal behavior, but let's wait until someone who actually runs multi-GPU workloads is able to give some perspective. |
Great, thanks! |
FYI, we're running a SQuAD training using pytorch-transformers, in 2 GPUs, while there's an AllenNLP NER training in a third GPU, and this SQuAD training had no impact on the AllenNLP NER training. |
Hi @pvcastro, This is an area that we're actively working on. Our multi-GPU setup is definitely not ideal and we're attempting to migrate to
Looking forward to hearing more. Regards. |
Hi @brendan-ai2, thanks for the reply! For 1, what do you propose I do for instrumenting the code? Any thoughts? Have you done this before and could share the procedure? For 2 (and this is the one that has been causing me more pain, because I'm unable to run at least 2 trainings in parallel), here is what I have from running 'nvidia-smi topo -m' in the DIGITS devbox setup with 3 RTX 2080ti: I don't have NVLink in this machine, but in the DGX1 and AC922 I used before I had, and the same issue was happening too. Anyway, I'll try measuring all these resources and will post the results here. If you have any suggestions on the best ways to approach this, please let me know. Thanks! |
For 1, you could try using the script at https://github.com/allenai/allennlp/blob/master/scripts/benchmark_iter.py. This will tell you how long it takes to read, tensorize and batch your dataset. Then you'll need to compare this against how long it takes to train a batch. AllenNLP's For 2, I'm less certain. I don't think |
Hi @brendan-ai2 , just to give you some feedback, I haven't got a chance to follow your suggestions yet. Once I have some information, I'll post it here. |
This should now be resolved by #3529 |
System (please complete the following information):
Question
I have used used AllenNLP NER training in multiple machines with multiple GPUs (DGX1, IBM AC922 and a custom DIGITS DevBox) and I have verified the following behavior in all of them:
There is no difference of training time when you use one or more GPUs for the same training. If you use 2 GPUs instead of 1, the amount of batches reduces by half, but each batch takes twice the time to process, so it doesn't make any difference.
When you have multiple trainings, running each one in a separate GPU (using CUDA_VISIBLE_DEVICES), the first training starts OK, but the following ones seems to perform very slowly, even though they are running on separate GPUs, hidden from each other using CUDA_VISIBLE_DEVICES. The same happens with inference as well.
Is this normal behavior? Anything I could do to improve this? I'm having a hard time taking advantage of robust machines with multiple GPUs. This has happened since AllenNLP 0.7.2 running PyTorch 0.4.1 as well, in all machines I used.
One note is that for AC922 and DGX1, the GPUs are NVLinked, but in the DIGITS DevBox there isn't this connection. And the behavior is the same anyway.
Thanks!
The text was updated successfully, but these errors were encountered: