-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1/n] Lightweight Ray AIR API refactor #36706
[1/n] Lightweight Ray AIR API refactor #36706
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Is this actually the bulk of the changes (aside from checkpoint class)? I guess I can think of the following followups:
- Make it so you can run any trainable func/class with Trainer.
- Update docs/deprecate old APIs.
This is what I have seen so far from Goku's course. There is also the After this is merged, I will start working on migrating our examples / deprecation, but I don't expect it to land until after the 2.6 branch cut :) |
|
||
usage_lib.record_library_usage("train") | ||
|
||
__all__ = [ | ||
"BackendConfig", | ||
"Checkpoint", | ||
"CheckpointConfig", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to avoid moving the Checkpoint class if we likely will remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I will remove them.
python/ray/train/__init__.py
Outdated
"get_dataset_shard", | ||
"get_trial_resources", | ||
"get_world_rank", | ||
"get_world_size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these accessors be part of session only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to get rid of the session
namespace going forward. It serves no purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should do that atomically then, it seems odd to move just a few specific methods initially. It is also a big change and I'm not clear if the pros outweigh the cons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think get_local_rank
and get_local_world_size
are currently being used anywhere (they are internal APIs and used to automatically set an environment variable for TorchTrainer). So I'd like to get rid of them. Should I be wrong I'm happy to add them later. So is get_node_rank
(which is also very confusing btw). Btw, the docs strings for all of these except get_local_rank
is wrong too.
For get_trial_dir
, get_trial_id
, get_trial_name
, get_experiment_name
I would first like to understand how your checkpointing changes are changing things, but I'm happy to add them back if you think they will keep being the same.
get_checkpoint
will probably become irrelevant.
That's already all the methods in session
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are all important utility methods. The more general concern is that there will always be more and more of these small things, just see ray.get_runtime_context() for example (I believe session should be renamed something like context, IMO).
We should plan for a large number of utility methods existing here in the future, and I don't think it's great to have them directly scoped under train
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my ideal API for this would be
def train_loop_per_worker(context):
# Use context.report(...), context.get_dataset_shard(...) etc.
Let me look into whether this can be made to happen in a compatible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to renaming Session
to *Context
.
I don't really see a benefit of passing it into the train_loop_per_worker
or trainable
, so I would prefer not to unless there is a strong enough reason for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, the plan changed to renaming from ray.air import session
-> from ray.train import context
:)
python/ray/train/__init__.py
Outdated
"TrainingIterator", | ||
"TRAIN_DATASET_KEY", | ||
"get_dataset_shard", | ||
"get_trial_resources", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Tune concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving this as well to Train right? You cannot launch trials without resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a train concept and being used in train examples (e.g. to set the number of threads for trainers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can discuss this along with:
Make it so you can run any trainable func/class with Trainer.
Right now the interfaces are:
ScalingConfig
(Train) -> PlacementGroupFactory
(Tune) -> PlacementGroup
(Core)
I would say that the usage in Train examples is a misuse of the API and extremely confusing. For setting the number of threads, this is being used as a proxy when we actually just want an API to get the number of CPUs assigned to the RayTrainWorker
actor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the placement group stuff also needs to move in this case.
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed # Include what you need from the following: from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig # ... def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code. Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code.
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code.
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of #36706 and #37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…#37023) Having multiple sessions floating around is confusing and we are going to replace the session concept with a unified context object between train and tune going forward (see ray-project#36706) The changes in detail: - Remove the `Session` interface class -- we are not planning to expose it to the user and it just introduces an additional level of abstraction that is not needed / not aligned with the longer term plan of having a unified context object between train and tune - Remove the `_TrainSessionImpl` and `_TuneSessionImpl` and instead push the functionality down into the `_StatusReporter` and the `_TrainSession` -- we might want to rename `_StatusReporter` to `_TuneSession` to be more consistent. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed # Include what you need from the following: from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig # ... def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into
ray.train
. It introduces a training context and makes report, get_dataset_shard, get_context, Checkpoint, Result, and the following configs:available in
ray.train
. No user facing changes yet, the old APIs still work.Going forward, it will be most consistent / symmetrical if these things are included in the following way:
We have many examples in #37123 on how this looks like in actual code.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.