Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training process for multi-GPUs #38

Open
Jaehoon-zx opened this issue Feb 21, 2023 · 5 comments
Open

Training process for multi-GPUs #38

Jaehoon-zx opened this issue Feb 21, 2023 · 5 comments

Comments

@Jaehoon-zx
Copy link

Hi, I am trying to run training/evaluation with 4 A100s.
However, after some experiments I noticed that the training speed was same compared with process trained with a single GPU.
Am I missing something?

@mo666666
Copy link

Hello, Jaehoon. I encounter the same problem. I conjecture this is because your Tensorflow package is not installed correctly. I recommend that you should follow the tips provided by https://www.tensorflow.org/install/pip step by step. This maybe helps you solve the problem.

@mo666666
Copy link

However, after solving the above issue, also as a 4*A100 user, I meet the CUDA out of the Memory issue. Do you encounter this issue for the code in this repository?

@Jaehoon-zx
Copy link
Author

Jaehoon-zx commented Feb 25, 2023

Take a look at this. #14 (comment)
I solved CUDA memory issue by adding it to main.py.

@mo666666
Copy link

Ok, thank you very much!

@mo666666
Copy link

Hi, Jaehoon! Does your training speed on 4A100 improve? After re-checking my experiment, I found it is still quite slow: for each GPU, the utilization rate is around 50%. Do you found another trick to accelerate the training speed or can the author @yang-song provide some advice?

henryaddison added a commit to henryaddison/mlde that referenced this issue Mar 17, 2023
henryaddison added a commit to henryaddison/mlde that referenced this issue Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants