Make checkpoint saving fully atomic #19970
Labels
checkpointing
Related to checkpointing
feature
Is an improvement or enhancement
help wanted
Open to be worked on
ver: 2.2.x
Bug description
Checkpoint is no longer atomic. I was implemented here to create checkpoints with ".part" and the rename them, but it is no longer implemented that way looking at code here.
Could we implemented it again in that way? If you kill a job that is training during checkpoint it will corrupted the file
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
cc @Borda @awaelchli
The text was updated successfully, but these errors were encountered: