Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpointing to remote file paths #2925

Merged

Conversation

f4hy
Copy link
Contributor

@f4hy f4hy commented Aug 12, 2020

What does this PR do?

Fixes #2916

Better handling of file paths when checkpointing.

  • Never try to makedirs on directories which exist
  • When writing checkpoints, write to a buffer then write the buffer to the desired out. This prevents the temporary .part file from being created. Which if remote could be slower.
  • Warn in a corner case with certain dependencies that remote directories are supported but delete operations are not.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team August 12, 2020 05:06
@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #2925 into master will increase coverage by 4%.
The diff coverage is 79%.

@@           Coverage Diff           @@
##           master   #2925    +/-   ##
=======================================
+ Coverage      86%     90%    +4%     
=======================================
  Files          80      80            
  Lines        7449    7462    +13     
=======================================
+ Hits         6430    6730   +300     
+ Misses       1019     732   -287     

@Borda Borda added the bug Something isn't working label Aug 12, 2020
Comment on lines +273 to +282
bytesbuffer = io.BytesIO()
# Can't use the new zipfile serialization for 1.6.0 because there's a bug in
# torch.hub.load_state_dict_from_url() that prevents it from loading the new files.
# More details can be found here: https://github.com/pytorch/pytorch/issues/42239
if LooseVersion(torch.__version__).version[:3] == [1, 6, 0]:
torch.save(checkpoint, tmp_path, _use_new_zipfile_serialization=False)
torch.save(checkpoint, bytesbuffer, _use_new_zipfile_serialization=False)
else:
torch.save(checkpoint, tmp_path)
os.replace(tmp_path, filepath)
torch.save(checkpoint, bytesbuffer)
with cloud_open(filepath, 'wb') as f:
f.write(bytesbuffer.getvalue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make this as a function too? when we have one-liner is_remote_path

@mergify mergify bot requested a review from a team August 12, 2020 08:25
@williamFalcon williamFalcon merged commit 56396ab into Lightning-AI:master Aug 12, 2020
ameliatqy pushed a commit to ameliatqy/pytorch-lightning that referenced this pull request Aug 17, 2020
@Borda Borda added this to the 0.9.0 milestone Aug 20, 2020
tarepan added a commit to tarepan/pytorch-lightning that referenced this pull request Oct 29, 2020
Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. Lightning-AI#2925),
so this commit do not add new functionality.
tarepan added a commit to tarepan/pytorch-lightning that referenced this pull request Oct 30, 2020
Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. Lightning-AI#2925),
so this commit do not add new functionality.
Borda added a commit that referenced this pull request Jan 5, 2021
…4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <thomas@grid.ai>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix doc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Borda pushed a commit that referenced this pull request Jan 6, 2021
…4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <thomas@grid.ai>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix doc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

(cherry picked from commit b0051e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModelCheckpoint with custom filepath don't support training on multiple nodes
3 participants