Speed optimization training from scratch #305

tholor · 2020-03-31T14:40:03Z

Optimizing speed for training from scratch using:

DistributedDataParallel instead of DataParallel
AMP
more workers for StreamingDataSilo

Thanks to @abhinavs95 we got quite some interesting insights on the effect of the above actions.
Let's merge some of the changes in his fork into master.

Baseline
Batch_size = 105
Accumulation steps = 9
Effective batch size = 945
Max sequence length = 128
Iterations = 5k
Time taken = 41 mins
Throughput = 7300 steps / hour => 811 effective batches / hour (batch size 945)
Total batches = 500k
Total estimated training time = 616 hours

DDP + AMP
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4.2k
Time taken = 38 mins
Throughput = 6631 steps / hour => 2210 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 223 hours

DDP + AMP + num workers=16
Batch_size = 80 * 4
Accumulation steps = 3
Effective batch size = 960
Max sequence length = 128
Iterations = 4k
Time taken = 26 mins
Throughput = 9231 steps / hour => 3077 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 160 hours

DDP only, num workers=16
Batch_size = 60 * 4
Accumulation steps = 4
Effective batch size = 960
Max sequence length = 128
Iterations = 5k
Time taken = 45 mins
Throughput = 6666 steps / hour => 1667 effective batches / hour (batch size 960)
Total batches = 500k*945/960 = 492187
Total estimated training time = 295 hours

(everything measured on a p3.8xlarge with 4x V100)

… to WrappedDDP

… batches per rank

tholor · 2020-05-13T12:58:56Z

We finally solved the issue of different ranks running out of sync due.
So missing steps for merging are:

shuffle data every epoch
measure speed impact of all_reduce()
estimate n_train_steps for original BERT style sequence pairs to steer LR schedule
fix saving/loading of checkpoints w/ existing directories
fix ZeroDivisionError in eval
clean up logging
allow switching between MLflowlogger and Stdout

tholor · 2020-05-14T08:50:39Z

Speed impact of all_reduce seems negligible if we are on a single machine:

With all_reduce:
500 steps => 5:15

Without:
500 steps => 5:14

(measured on a g3.8xlarge with 2x M60)

If this becomes more significant, we could do this sync only in the last phase (e.g. 10%) of an epoch as ranks won't run out of batches before...

tanaysoni

Looking good! 🚀

Dockerfile-GPU

farm/modeling/optimization.py

Timoeller

The code seems reasonably separated from normal functions.

I let the train QA test run through to verify the gradient clipping and the model performance is very similar (a bit lower by 0.5%, this could be due to variance though variance was usually lower)

farm/train.py

…ptim

abhinavs95 and others added 9 commits March 20, 2020 10:44

Adding changes for DDP + AMP

d0867a0

Merge remote-tracking branch 'upstream/master'

461a45c

remove initializa_device_settings from example script

088ff83

Merge remote-tracking branch 'aws_fork/master' into train_optim

526ae17

enable checkpointing again. change params

bf13c4a

WIP fix checkpointing for DDP. fix streamdatasilo for non-DDP. switch…

94cae29

… to WrappedDDP

WIP reproducibility of runs. add seeds

f595cf0

clean up

a4c0b98

update params in example

934a953

tholor self-assigned this Apr 3, 2020

tholor added enhancement New feature or request part: model part: trainer Trainer labels Apr 3, 2020

tholor added 8 commits April 7, 2020 09:29

WIP adjust sagemaker script

ec76290

improve logging for sagemaker

5486de8

update example scripts

a43cfde

merge latest master incl. original bert style next_sent_prediction

e224658

update trainer_state_dict and eval only in main proc

1e6d3d0

fix epoch in tqdm bar

6541a24

catch failing reports

5d3aef4

fix desynchronization issue in distributed training w/ unequal num of…

5ca3a5c

… batches per rank

fix numbering of steps for saving / resuming

55e4013

tholor added 6 commits May 14, 2020 11:19

move all_reduce sync into separate fn

965eb8a

minor cleaning

862677b

add heuristic estimate of samples

7587141

simplify grouper in StreamingDataSet

6867dd2

update example scripts

04b0806

don't allow more docs for estimate than for actual training

2742aea

tholor and others added 7 commits May 15, 2020 12:20

merge latest master

0f61975

fix estimate for max_docs=None

d3436b7

Add shuffling of data for StreamingDataSilo

45c5f65

Add randomization of file

5886fb1

Fix filepath conversion

3f783ae

Write remainder docs to file in randomize_and_split_file()

1c5be41

Remove #TODO

c226c16

tholor mentioned this pull request Jun 2, 2020

Zero-based counting in "train epoch" progress bar. #381

Closed

smaller fixes, gradient clipping, amp support

963a878

tanaysoni mentioned this pull request Jun 5, 2020

Get Failed to log params warning without using MLFlowLogger #401

Closed

tholor added 4 commits June 5, 2020 16:43

change args. fix file splitting in distributed mode. log learning rate

6b535ee

fix filepath for splitting

2c927ba

add dockerfile for sagemaker training from scratch

ddde9b6

merge latest master

b6f55d9

ghost changed the title ~~WIP Speed optimization training from scratch~~ Speed optimization training from scratch Jun 18, 2020

ghost requested review from tanaysoni and Timoeller June 18, 2020 09:20

tanaysoni reviewed Jun 18, 2020

View reviewed changes

Dockerfile-GPU Outdated Show resolved Hide resolved

farm/modeling/optimization.py Outdated Show resolved Hide resolved

Timoeller approved these changes Jun 18, 2020

View reviewed changes

farm/train.py Outdated Show resolved Hide resolved

tholor and others added 5 commits June 19, 2020 10:49

simplify calculation of optimization steps

f1a1c8e

Add option to disable gradient clipping

0394d8f

simplify dockerfile

9ff6d8b

Update docs and default for gradient clipping

4b0f0be

Merge branch 'train_optim' of github.com:deepset-ai/FARM into train_o…

aac0cb9

…ptim

tanaysoni approved these changes Jun 19, 2020

View reviewed changes

tholor merged commit 5151b36 into master Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed optimization training from scratch #305

Speed optimization training from scratch #305

tholor commented Mar 31, 2020 •

edited

Loading

tholor commented May 13, 2020 •

edited

Loading

tholor commented May 14, 2020

tanaysoni left a comment

Timoeller left a comment

Speed optimization training from scratch #305

Speed optimization training from scratch #305

Conversation

tholor commented Mar 31, 2020 • edited Loading

tholor commented May 13, 2020 • edited Loading

tholor commented May 14, 2020

tanaysoni left a comment

Choose a reason for hiding this comment

Timoeller left a comment

Choose a reason for hiding this comment

tholor commented Mar 31, 2020 •

edited

Loading

tholor commented May 13, 2020 •

edited

Loading