You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
They don't have success file but are in GCS so orbax thinks its tmp and cleans it up.
I would suggest always or never saving COMMIT_SUCCESS file.
This is not blocking me (easy to just write extra commit success files once I found this) but it felt like I should report because it was very unexpected behavior and moving around checkpoints is super common.
The text was updated successfully, but these errors were encountered:
Thanks for the report, we currently have different behavior for ensuring atomicity on GCS vs. other filesystems. This was sort of a practice that we inherited from earlier code. I will look into standardizing this.
+1 on this. I was having a lot of issues trying to load a checkpoint that was saved locally and copied to GCS, and orbax keeps telling me that the checkpoint is incomplete because of the missing _COMMIT_SUCCESS_FILE file.
Update: our previous intention was to switch to the same logic everywhere, i.e. relying on atomic rename. It is not possible to rely on this for all filesystems, though, so we're instead intending to make it configurable, while defaulting to atomic rename for GCS and internal. This has a higher priority now, to better support cloud users - hopefully will get to it within a month.
because of this line
orbax/checkpoint/orbax/checkpoint/utils.py
Line 669 in c7a7fd4
They don't have success file but are in GCS so orbax thinks its tmp and cleans it up.
I would suggest always or never saving COMMIT_SUCCESS file.
This is not blocking me (easy to just write extra commit success files once I found this) but it felt like I should report because it was very unexpected behavior and moving around checkpoints is super common.
The text was updated successfully, but these errors were encountered: