-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Sync trial artifacts to cloud #32334
[Tune] Sync trial artifacts to cloud #32334
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/sync_artifacts_to_cloud
To get around this, we recommend saving trial artifacts a separate files with unique filenames. | ||
|
||
For example, instead of doing this: | ||
|
||
.. code-block:: python | ||
|
||
def appending_train_fn(config): | ||
for i in range(config["num_epochs"]): | ||
with open("./artifact.txt", "a") as f: | ||
f.write(f"Some data about iteration {i}\n") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to recommend this generically? I think you want to say specifically that you will want to do this only when you need to append, but for example if the file is to be rewritten (like a checkpoint) it's probably ok right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, even if it's rewritten, as long as the filename is the same, it'll keep getting overwritten by the driver's old copy of the artifact. So, unique files are kind of needed for restore behavior to work.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…/sync_artifacts_to_cloud
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Closes ray-project#30071 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Closes ray-project#30071
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Closes ray-project#30071 Signed-off-by: elliottower <elliot@elliottower.com>
Why are these changes needed?
When a remote upload directory is specified, artifacts logged to the trial directory of remote workers are not uploaded. This PR enables artifact syncing to happen in a similar fashion as checkpoint syncing. Whenever a checkpoint is uploaded, also push artifacts to the cloud. Whenever a trial is restored from the cloud, also pull artifacts. When trials complete/pause, upload artifacts.
See user issues:
What is an artifact?
Artifacts include everything contained in the trial level directory that is logged by the trainable itself and is not a trial checkpoint.
More Implementation Details
Where do we perform artifact uploading?
Where do we perform artifact downloading?
What about artifacts created by the driver? Who is responsible for uploading these?
Warn users if artifact syncing takes a long time. Tell them to consider reducing the number of saved artifacts, or disable artifact syncing.
What about Train workers that live on different nodes? This will be a follow-up PR.
Known Limitations of this PR
sync_artifacts=False
will only disable artifacts from being uploaded from worker nodes. The driver node will still upload any artifacts that exist in its trial dirs. See [AIR][Tune] MakeSyncConfig(sync_artifacts=False)
work fully (for all syncing methods) #32783.Related issue number
Closes #30071
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.