-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13
Comments
Currently, there is no solution, and I am currently using a single A800 card for training. |
Not only the aesthetic model, but also other tasks are facing the same issue. |
This is caused by |
Thanks so much @desaixie for investigating this! Strangely enough, it works on my machine. If you have a working version of the code, would you mind opening a pull request? To be honest, I never really wrapped my head around the accelerate save/load API. |
After testing, it is confirmed that multi-card training is feasible when the acceleration is reduced to 0.17 @kvablack @mihirp1998 @desaixie |
pip install accelerate==0.17 |
Thanks everyone, I added |
Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation?
The text was updated successfully, but these errors were encountered: