Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

cjt222 · 2023-09-07T09:24:55Z

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation?

mihirp1998 · 2023-09-12T03:07:49Z

same issue after 2 epochs, the model seems to get stuck at waiting for rewards..

Please let me know if you have a soln @cjt222 or @kvablack ?

cjt222 · 2023-09-12T03:22:09Z

Currently, there is no solution, and I am currently using a single A800 card for training.

cjt222 · 2023-09-12T03:37:45Z

Not only the aesthetic model, but also other tasks are facing the same issue.

desaixie · 2023-09-12T04:28:07Z

This is caused by accelerator.save_state() at this line. This bug appeared after a recent commit . It was meant to solve a bug in diffuers as discussed in #10 . A temporary solution is to simply commenting out this line. A complete solution is to follow this or this training script, use their code that deals with LoRA to replace those in the current DDPO training script.

kvablack · 2023-09-12T04:40:26Z

Thanks so much @desaixie for investigating this! Strangely enough, it works on my machine.

If you have a working version of the code, would you mind opening a pull request? To be honest, I never really wrapped my head around the accelerate save/load API.

cjt222 · 2023-09-13T02:11:44Z

After testing, it is confirmed that multi-card training is feasible when the acceleration is reduced to 0.17 @kvablack @mihirp1998 @desaixie

cjt222 · 2023-09-13T02:51:22Z

pip install accelerate==0.17

kvablack · 2023-09-16T05:55:26Z

Thanks everyone, I added accelerate==0.17 to setup.py.

kvablack closed this as completed Sep 16, 2023

mihirp1998 mentioned this issue Oct 14, 2023

question about accelerate mihirp1998/AlignProp#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

cjt222 commented Sep 7, 2023

mihirp1998 commented Sep 12, 2023

cjt222 commented Sep 12, 2023

cjt222 commented Sep 12, 2023

desaixie commented Sep 12, 2023

kvablack commented Sep 12, 2023

cjt222 commented Sep 13, 2023

cjt222 commented Sep 13, 2023

kvablack commented Sep 16, 2023

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation? #13

Comments

cjt222 commented Sep 7, 2023

mihirp1998 commented Sep 12, 2023

cjt222 commented Sep 12, 2023

cjt222 commented Sep 12, 2023

desaixie commented Sep 12, 2023

kvablack commented Sep 12, 2023

cjt222 commented Sep 13, 2023

cjt222 commented Sep 13, 2023

kvablack commented Sep 16, 2023