Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

Closed
cjt222 opened this issue Sep 7, 2023 · 8 comments

Comments

@cjt222
Copy link

cjt222 commented Sep 7, 2023

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation?

@mihirp1998
Copy link

same issue after 2 epochs, the model seems to get stuck at waiting for rewards..

Please let me know if you have a soln @cjt222 or @kvablack ?

@cjt222
Copy link
Author

cjt222 commented Sep 12, 2023

Currently, there is no solution, and I am currently using a single A800 card for training.

@cjt222
Copy link
Author

cjt222 commented Sep 12, 2023

Not only the aesthetic model, but also other tasks are facing the same issue.

@desaixie
Copy link
Contributor

This is caused by accelerator.save_state() at this line. This bug appeared after a recent commit . It was meant to solve a bug in diffuers as discussed in #10 . A temporary solution is to simply commenting out this line. A complete solution is to follow this or this training script, use their code that deals with LoRA to replace those in the current DDPO training script.

@kvablack
Copy link
Owner

Thanks so much @desaixie for investigating this! Strangely enough, it works on my machine.

If you have a working version of the code, would you mind opening a pull request? To be honest, I never really wrapped my head around the accelerate save/load API.

@cjt222
Copy link
Author

cjt222 commented Sep 13, 2023

After testing, it is confirmed that multi-card training is feasible when the acceleration is reduced to 0.17 @kvablack @mihirp1998 @desaixie

@cjt222
Copy link
Author

cjt222 commented Sep 13, 2023

pip install accelerate==0.17

@kvablack
Copy link
Owner

Thanks everyone, I added accelerate==0.17 to setup.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants