Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Sync trial artifacts to cloud #32334

Merged
merged 75 commits into from
Feb 23, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Feb 8, 2023

Why are these changes needed?

When a remote upload directory is specified, artifacts logged to the trial directory of remote workers are not uploaded. This PR enables artifact syncing to happen in a similar fashion as checkpoint syncing. Whenever a checkpoint is uploaded, also push artifacts to the cloud. Whenever a trial is restored from the cloud, also pull artifacts. When trials complete/pause, upload artifacts.

See user issues:

What is an artifact?

Artifacts include everything contained in the trial level directory that is logged by the trainable itself and is not a trial checkpoint.

D = driver syncing
C = checkpoint syncing from trainable
A = artifact syncing from trainable

# Head node
experiment_dir/
	experiment-state.json [D]
	basic-variant-state.json [D]
	callback-state.pkl [D]
	tuner.pkl [D]
	trial_0/
		artifact.pkl [A]
		checkpoint_00000/ [C]
		progress.csv [D]
		results.json [D]
		params.json [D]
		params.pkl [D]
	trial_1/
		progress.csv [D]
		results.json [D]
		params.json [D]
		params.pkl [D]

# Worker node
experiment_dir/
	trial_1/
		artifact.pkl [A]
		checkpoint_00000/ [C]

More Implementation Details

Where do we perform artifact uploading?

  • Paused/stopped trial (Trainable.stop)
  • Checkpointing trial (Trainable.save)

Where do we perform artifact downloading?

  • Trainable.restore
  • Only download artifacts for a trial if we’re loading it from a checkpoint. (This handles resume_unfinished, resume_errored cases.)

What about artifacts created by the driver? Who is responsible for uploading these?

  • For now: just go with excluding the default logger artifacts, and do double-uploading for user defined callback artifacts.
  • Follow-up PR:
    • Callbacks should define what artifacts they will write to (folders, files)
    • Trainables should exclude all these driver callbacks from their upload. This is to prevent double-uploading of artifacts for workers on the head node.
    • For user-defined callbacks, they will need to also supply this, but if they don’t, we’ll do some double-uploading, which doesn’t break anything – just inefficient.
  • In the future:
    • We can move the responsibility of the logger artifacts to the trainable. Driver shouldn’t upload anything in the trial level directories.
    • Requires more refactoring of callbacks. Too wide of a scope for this current PR.

Warn users if artifact syncing takes a long time. Tell them to consider reducing the number of saved artifacts, or disable artifact syncing.

What about Train workers that live on different nodes? This will be a follow-up PR.

Known Limitations of this PR

  1. sync_artifacts=False will only disable artifacts from being uploaded from worker nodes. The driver node will still upload any artifacts that exist in its trial dirs. See [AIR][Tune] Make SyncConfig(sync_artifacts=False) work fully (for all syncing methods) #32783.
  2. Restoring and continuing to append to an artifact isn't supported currently, and this is highlighted in the documentation. See [AIR][Tune] Enable trial artifacts to be restored from cloud and updated correctly #32782.

Related issue number

Closes #30071

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Comment on lines 274 to 284
To get around this, we recommend saving trial artifacts a separate files with unique filenames.

For example, instead of doing this:

.. code-block:: python

def appending_train_fn(config):
for i in range(config["num_epochs"]):
with open("./artifact.txt", "a") as f:
f.write(f"Some data about iteration {i}\n")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to recommend this generically? I think you want to say specifically that you will want to do this only when you need to append, but for example if the file is to be rewritten (like a checkpoint) it's probably ok right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, even if it's rewritten, as long as the filename is the same, it'll keep getting overwritten by the driver's old copy of the artifact. So, unique files are kind of needed for restore behavior to work.

justinvyu and others added 4 commits February 22, 2023 13:58
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@richardliaw richardliaw merged commit 1e7d2da into ray-project:master Feb 23, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Closes ray-project#30071

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Closes ray-project#30071
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Closes ray-project#30071

Signed-off-by: elliottower <elliot@elliottower.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] Sync artifacts in trial/ directories to cloud
6 participants