[AIR][Tune] Make trial checkpoint + artifact upload happen atomically #32823
Labels
P2
Important issue, but not time-critical
ray-team-created
Ray Team created
tune
Tune-related issues
#32334 added artifact syncing to cloud.
This happens on every checkpoint so that artifact state is consistent with checkpoint state. However, it's possible for the experiment to crash in between checkpoint upload and artifact upload, which would lead to inconsistency.
In general, uploading to cloud should happen atomically. Right now, if
upload_to_uri(exclude)
is provided, we'll write individual files one at a time. If the upload operation fails somewhere, then it'll retain only a partial checkpoint which may fail on restore/usage later.Question: Should these two happen atomically?
Pros:
Cons:
Remove this TODO after this is resolved:
ray/release/tune_tests/cloud_tests/workloads/run_cloud_test.py
Line 849 in 2fb1bcc
The text was updated successfully, but these errors were encountered: