-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed optimization training from scratch #305
Conversation
… batches per rank
We finally solved the issue of different ranks running out of sync due.
|
Speed impact of all_reduce seems negligible if we are on a single machine: With all_reduce: Without: (measured on a g3.8xlarge with 2x M60) If this becomes more significant, we could do this sync only in the last phase (e.g. 10%) of an epoch as ranks won't run out of batches before... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems reasonably separated from normal functions.
I let the train QA test run through to verify the gradient clipping and the model performance is very similar (a bit lower by 0.5%, this could be due to variance though variance was usually lower)
Optimizing speed for training from scratch using:
Thanks to @abhinavs95 we got quite some interesting insights on the effect of the above actions.
Let's merge some of the changes in his fork into master.
Baseline
Batch_size = 105
Accumulation steps = 9
Effective batch size = 945
Max sequence length = 128
Iterations = 5k
Time taken = 41 mins
Throughput = 7300 steps / hour => 811 effective batches / hour (batch size 945)
Total batches = 500k
Total estimated training time = 616 hours
DDP + AMP
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4.2k
Time taken = 38 mins
Throughput = 6631 steps / hour => 2210 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 223 hours
DDP + AMP + num workers=16
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4k
Time taken = 26 mins
Throughput = 9231 steps / hour => 3077 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 160 hours
DDP only, num workers=16
Batch_size = 60 * 4
Accumulation steps = 4
Effective batch size = 960
Max sequence length = 128
Iterations = 5k
Time taken = 45 mins
Throughput = 6666 steps / hour => 1667 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 295 hours
(everything measured on a p3.8xlarge with 4x V100)