-
Notifications
You must be signed in to change notification settings - Fork 248
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Speed optimization training from scratch (#305)
* Adding changes for DDP + AMP * remove initializa_device_settings from example script * enable checkpointing again. change params * WIP fix checkpointing for DDP. fix streamdatasilo for non-DDP. switch to WrappedDDP * WIP reproducibility of runs. add seeds * clean up * update params in example * WIP adjust sagemaker script * improve logging for sagemaker * update example scripts * update trainer_state_dict and eval only in main proc * fix epoch in tqdm bar * catch failing reports * fix desynchronization issue in distributed training w/ unequal num of batches per rank * fix numbering of steps for saving / resuming * move all_reduce sync into separate fn * minor cleaning * add heuristic estimate of samples * simplify grouper in StreamingDataSet * update example scripts * don't allow more docs for estimate than for actual training * fix estimate for max_docs=None * Add shuffling of data for StreamingDataSilo * Add randomization of file * Fix filepath conversion * Write remainder docs to file in randomize_and_split_file() * Remove #TODO * smaller fixes, gradient clipping, amp support * change args. fix file splitting in distributed mode. log learning rate * fix filepath for splitting * add dockerfile for sagemaker training from scratch * simplify calculation of optimization steps * Add option to disable gradient clipping * simplify dockerfile * Update docs and default for gradient clipping Co-authored-by: Abhinav Sharma <abhinav0301@gmail.com> Co-authored-by: Tanay Soni <tanaysoni12@gmail.com> Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
- Loading branch information
1 parent
06e45f9
commit 5151b36
Showing
11 changed files
with
527 additions
and
163 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
FROM deepset/farm-gpu:latest | ||
COPY examples examples | ||
#COPY data/test data/test | ||
|
||
# ENV SAGEMAKER_PROGRAM train.py | ||
ENTRYPOINT ["python3","-m", "torch.distributed.launch", "--nproc_per_node=4", "examples/train_from_scratch_with_sagemaker.py"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.