-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training extremely slow #11
Comments
I had the same issue when I used This work assumes you have enough RAM on your server to store the entire dataset. Otherwise, I do not think this work is helpful for distributed training since |
@afzalxo Thanks for the information. My server has 384GB of RAM while the size of
This is a bit strange though. Could you share how much lower is your accuracy? The maximum resolution used by the training script is only 224x224 (more precisely, 192x192 for training and 224x224 for validation), so I wouldn't expect too much difference between re-scaling from 400x400 and re-scaling from 500x500.
So what you are saying that if we don't use |
Interesting. I tried this experiment on two different servers although both have less memory than 339GB and I faced the same issue as you on both servers. When I utilized
I don't have exact numbers right now since the logs are in a messy state but the difference was around 2% IIRC. However I ran my experiments with 160px input throughout rather than with progressive resizing to 192px. So the accuracy gap is likely due to 1) Different input resolution and 2) Different input dataset configuration of
I'm not sure about this point. The authors do mention that
No not at all. I am saying that Why are you using |
@afzalxo Thanks for your reply.
I think you're probably right. 339GB is not that far from the limit of my server. I'll try again with
Makes sense!
I see. I didn't know that
That server only has 3 CPUs per GPU. There's another server I can use that has 10 CPUs per GPU but only half the amount of RAM of the former (~180GB). I guess I'll have to play a bit with the parameters of |
Hello,
I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:
Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:
Is there anything I should check?
Thank you in advance for your response.
The text was updated successfully, but these errors were encountered: