Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed optimization training from scratch #305

Merged
merged 41 commits into from
Jun 19, 2020
Merged

Speed optimization training from scratch #305

merged 41 commits into from
Jun 19, 2020

Conversation

tholor
Copy link
Member

@tholor tholor commented Mar 31, 2020

Optimizing speed for training from scratch using:

  • DistributedDataParallel instead of DataParallel
  • AMP
  • more workers for StreamingDataSilo

Thanks to @abhinavs95 we got quite some interesting insights on the effect of the above actions.
Let's merge some of the changes in his fork into master.

Baseline
Batch_size = 105
Accumulation steps = 9
Effective batch size = 945
Max sequence length = 128
Iterations = 5k
Time taken = 41 mins
Throughput = 7300 steps / hour => 811 effective batches / hour (batch size 945)
Total batches = 500k
Total estimated training time = 616 hours

DDP + AMP
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4.2k
Time taken = 38 mins
Throughput = 6631 steps / hour => 2210 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 223 hours

DDP + AMP + num workers=16
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4k
Time taken = 26 mins
Throughput = 9231 steps / hour => 3077 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 160 hours

DDP only, num workers=16
Batch_size = 60 * 4
Accumulation steps = 4
Effective batch size = 960
Max sequence length = 128
Iterations = 5k
Time taken = 45 mins
Throughput = 6666 steps / hour => 1667 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 295 hours

(everything measured on a p3.8xlarge with 4x V100)

@tholor tholor self-assigned this Apr 3, 2020
@tholor tholor added enhancement New feature or request part: model part: trainer Trainer labels Apr 3, 2020
@tholor
Copy link
Member Author

tholor commented May 13, 2020

We finally solved the issue of different ranks running out of sync due.
So missing steps for merging are:

  • shuffle data every epoch
  • measure speed impact of all_reduce()
  • estimate n_train_steps for original BERT style sequence pairs to steer LR schedule
  • fix saving/loading of checkpoints w/ existing directories
  • fix ZeroDivisionError in eval
  • clean up logging
  • allow switching between MLflowlogger and Stdout

@tholor
Copy link
Member Author

tholor commented May 14, 2020

Speed impact of all_reduce seems negligible if we are on a single machine:

With all_reduce:
500 steps => 5:15

Without:
500 steps => 5:14

(measured on a g3.8xlarge with 2x M60)

If this becomes more significant, we could do this sync only in the last phase (e.g. 10%) of an epoch as ranks won't run out of batches before...

@ghost ghost changed the title WIP Speed optimization training from scratch Speed optimization training from scratch Jun 18, 2020
@ghost ghost requested review from tanaysoni and Timoeller June 18, 2020 09:20
Copy link
Contributor

@tanaysoni tanaysoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! 🚀

Dockerfile-GPU Outdated Show resolved Hide resolved
farm/modeling/optimization.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems reasonably separated from normal functions.

I let the train QA test run through to verify the gradient clipping and the model performance is very similar (a bit lower by 0.5%, this could be due to variance though variance was usually lower)

farm/train.py Outdated Show resolved Hide resolved
@tholor tholor merged commit 5151b36 into master Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request part: model part: trainer Trainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants