Training process for multi-GPUs #38

Jaehoon-zx · 2023-02-21T04:08:00Z

Hi, I am trying to run training/evaluation with 4 A100s.
However, after some experiments I noticed that the training speed was same compared with process trained with a single GPU.
Am I missing something?

mo666666 · 2023-02-24T05:27:44Z

Hello, Jaehoon. I encounter the same problem. I conjecture this is because your Tensorflow package is not installed correctly. I recommend that you should follow the tips provided by https://www.tensorflow.org/install/pip step by step. This maybe helps you solve the problem.

mo666666 · 2023-02-24T05:33:57Z

However, after solving the above issue, also as a 4*A100 user, I meet the CUDA out of the Memory issue. Do you encounter this issue for the code in this repository?

Jaehoon-zx · 2023-02-25T02:50:25Z

Take a look at this. #14 (comment)
I solved CUDA memory issue by adding it to main.py.

mo666666 · 2023-02-25T04:31:29Z

Ok, thank you very much!

mo666666 · 2023-02-26T08:24:51Z

Hi, Jaehoon! Does your training speed on 4A100 improve? After re-checking my experiment, I found it is still quite slow: for each GPU, the utilization rate is around 50%. Do you found another trick to accelerate the training speed or can the author @yang-song provide some advice?

Massage distribution

henryaddison added a commit to henryaddison/mlde that referenced this issue Mar 17, 2023

Merge pull request yang-song#38 from henryaddison/massage-distribution

ec2f30d

Massage distribution

henryaddison added a commit to henryaddison/mlde that referenced this issue Mar 21, 2023

Merge pull request yang-song#38 from henryaddison/massage-distribution

dec8c26

Massage distribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training process for multi-GPUs #38

Training process for multi-GPUs #38

Jaehoon-zx commented Feb 21, 2023

mo666666 commented Feb 24, 2023

mo666666 commented Feb 24, 2023

Jaehoon-zx commented Feb 25, 2023 •

edited

Loading

mo666666 commented Feb 25, 2023

mo666666 commented Feb 26, 2023

Training process for multi-GPUs #38

Training process for multi-GPUs #38

Comments

Jaehoon-zx commented Feb 21, 2023

mo666666 commented Feb 24, 2023

mo666666 commented Feb 24, 2023

Jaehoon-zx commented Feb 25, 2023 • edited Loading

mo666666 commented Feb 25, 2023

mo666666 commented Feb 26, 2023

Jaehoon-zx commented Feb 25, 2023 •

edited

Loading