Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make checkpoint saving fully atomic #19970

Closed
radomirgr opened this issue Jun 12, 2024 · 3 comments · Fixed by #20011
Closed

Make checkpoint saving fully atomic #19970

radomirgr opened this issue Jun 12, 2024 · 3 comments · Fixed by #20011
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on ver: 2.2.x

Comments

@radomirgr
Copy link

radomirgr commented Jun 12, 2024

Bug description

Checkpoint is no longer atomic. I was implemented here to create checkpoints with ".part" and the rename them, but it is no longer implemented that way looking at code here.

Could we implemented it again in that way? If you kill a job that is training during checkpoint it will corrupted the file

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda @awaelchli

@radomirgr radomirgr added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 12, 2024
@awaelchli
Copy link
Contributor

@radomirgr Feel free to implement this if you'd like. The contribution is welcome.

@awaelchli awaelchli added feature Is an improvement or enhancement checkpointing Related to checkpointing and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 17, 2024
@awaelchli awaelchli changed the title Checkpoint saving isn't atomic Make checkpoint saving fully atomic Jun 17, 2024
@awaelchli awaelchli added the help wanted Open to be worked on label Jun 22, 2024
@corwinjoy
Copy link
Contributor

This looks straightforward. I will go ahead and create a PR to restore this behavior.

@corwinjoy
Copy link
Contributor

I have created a PR to address this which uses a fsspec transaction context. This should make saves atomic again.
#20011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on ver: 2.2.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants