Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Tune] Make trial checkpoint + artifact upload happen atomically #32823

Open
justinvyu opened this issue Feb 24, 2023 · 0 comments
Open
Labels
P2 Important issue, but not time-critical ray-team-created Ray Team created tune Tune-related issues

Comments

@justinvyu
Copy link
Contributor

justinvyu commented Feb 24, 2023

#32334 added artifact syncing to cloud.

This happens on every checkpoint so that artifact state is consistent with checkpoint state. However, it's possible for the experiment to crash in between checkpoint upload and artifact upload, which would lead to inconsistency.

In general, uploading to cloud should happen atomically. Right now, if upload_to_uri(exclude) is provided, we'll write individual files one at a time. If the upload operation fails somewhere, then it'll retain only a partial checkpoint which may fail on restore/usage later.

Question: Should these two happen atomically?

Pros:

  • Consistent state

Cons:

  • We care more about checkpoints for downstream tasks + training resume, so we may want to upload the checkpoint fully first, and perfect artifact consistency may not matter that much.

Remove this TODO after this is resolved:

# TODO(ml-team): Compare to latest checkpoint only after

@justinvyu justinvyu added tune Tune-related issues P2 Important issue, but not time-critical air labels Feb 24, 2023
@Yard1 Yard1 added the ray-team-created Ray Team created label Mar 22, 2023
@anyscalesam anyscalesam removed the air label Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Important issue, but not time-critical ray-team-created Ray Team created tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

3 participants