Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding the gradient accumulation #7

Closed
lehduong opened this issue Oct 18, 2023 · 5 comments
Closed

Question regarding the gradient accumulation #7

lehduong opened this issue Oct 18, 2023 · 5 comments

Comments

@lehduong
Copy link

lehduong commented Oct 18, 2023

Hi, thanks for your implementation.

I noticed you accumulated gradients of two models with two different context managers (here). Could you let me know if you verified your implementation with gradient accumulation step different than 1? Apparently, this approach can be erroneous according to this and followed-up comments. I believe the newer version of hf's accelerate has already allowed the context manager to receive multiple models as in here.

@kxhit
Copy link
Owner

kxhit commented Oct 19, 2023

Hi @lehduong thanks for pointing this out! Yes, in Zero123, the gradient accumulation is 1 so this doesn't matter. And I know people are trying to fix it for more than one model. According to your link, the correct way to do it for multiple models is
with accelerator.accumulate(model1, model2)? Would be very happy if you want to PR! Thank you!

@lehduong
Copy link
Author

lehduong commented Oct 20, 2023

Yes, you only need to change this line to with accelerator.accumulate(unet, cc_projection) and use accelerate >= 0.23 (not sure if it is the earliest accelerate version that supports this feature but I'm using it). I did a quick experiment and I attached the image below to compare the training loss of two accumulation approaches. I used the resolution of 512, (per device) batch size of 24, and the gradient_accumulation_steps is set to 8 (effective batch size is 1536 as in the original implementation).

Screenshot 2023-10-20 at 2 17 19 PM

kxhit added a commit that referenced this issue Oct 21, 2023
@kxhit
Copy link
Owner

kxhit commented Oct 21, 2023

Thanks a lot for your contribution! Just pushed a fix, please reopen the issue if anything needs further change!

@kxhit kxhit closed this as completed Oct 21, 2023
@cfeng16
Copy link

cfeng16 commented Oct 29, 2023

Hi @lehduong , can I ask do you use gradient accumulation for multiple models in distributed training setting (say multiple GPUs)?

@lehduong
Copy link
Author

lehduong commented Oct 29, 2023

I trained the model on a single machine (8gpus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants