-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Support Multi-GPU for Transformer model #356
Conversation
@yzh119 Let's work together on this PR. Currently, there seems to be a bug in this PR. If you try the grad_accum argument to accumulate gradients of multiple batch, the loss drops very slowly. Close for now. Reopen when we finish. |
|
I have no write access to your repo, so I put the updated version here. |
@yzh119 just added you as collaborator. Can you move your changes here? |
With |
@yzh119 I think I have finished this PR and fixed the synchronization bug. And now the 4 GPU behaves like single GPU. Can you review again and then we can merge. |
Description
This PR uses mutli-process to parallelize training of Transformer model. With batch size 128, 4 GPU gives 3.3x speed up. With batch size 4096, 4 GPU give 3.8x speed up.
Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change