[1/n] Lightweight Ray AIR API refactor #36706

pcmoritz · 2023-06-22T16:48:06Z

Why are these changes needed?

This PR removes some circularities in the Ray AIR import system so we can put the training related functions into ray.train. It introduces a training context and makes report, get_dataset_shard, get_context, Checkpoint, Result, and the following configs:

CheckpointConfig
DataConfig
FailureConfig
RunConfig
ScalingConfig

available in ray.train. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

from ray import train, tune, serve # Pick the subset that is needed
# Include what you need from the following:
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

# ...

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

We have many examples in #37123 on how this looks like in actual code.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

Nice. Is this actually the bulk of the changes (aside from checkpoint class)? I guess I can think of the following followups:

Make it so you can run any trainable func/class with Trainer.
Update docs/deprecate old APIs.

pcmoritz · 2023-06-27T17:45:34Z

This is what I have seen so far from Goku's course. There is also the Result type that is currently still in AIR. Make it so you can run any trainable func/class with Trainer. <- let's chat more about this, I'm actually not super sure what needs to be done here.

After this is merged, I will start working on migrating our examples / deprecation, but I don't expect it to land until after the 2.6 branch cut :)

ericl · 2023-06-27T19:51:17Z

python/ray/train/__init__.py


 usage_lib.record_library_usage("train")

 __all__ = [
    "BackendConfig",
+    "Checkpoint",
+    "CheckpointConfig",


Do we want to avoid moving the Checkpoint class if we likely will remove it?

Alright, I will remove them.

ericl · 2023-06-27T19:51:30Z

python/ray/train/__init__.py

+    "get_dataset_shard",
+    "get_trial_resources",
+    "get_world_rank",
+    "get_world_size",


Should these accessors be part of session only?

I want to get rid of the session namespace going forward. It serves no purpose.

I feel like we should do that atomically then, it seems odd to move just a few specific methods initially. It is also a big change and I'm not clear if the pros outweigh the cons.

So I don't think get_local_rank and get_local_world_size are currently being used anywhere (they are internal APIs and used to automatically set an environment variable for TorchTrainer). So I'd like to get rid of them. Should I be wrong I'm happy to add them later. So is get_node_rank (which is also very confusing btw). Btw, the docs strings for all of these except get_local_rank is wrong too.

For get_trial_dir, get_trial_id, get_trial_name, get_experiment_name I would first like to understand how your checkpointing changes are changing things, but I'm happy to add them back if you think they will keep being the same.

get_checkpoint will probably become irrelevant.

That's already all the methods in session.

So these are all important utility methods. The more general concern is that there will always be more and more of these small things, just see ray.get_runtime_context() for example (I believe session should be renamed something like context, IMO).

We should plan for a large number of utility methods existing here in the future, and I don't think it's great to have them directly scoped under train.

Yeah, my ideal API for this would be

def train_loop_per_worker(context): # Use context.report(...), context.get_dataset_shard(...) etc.

Let me look into whether this can be made to happen in a compatible way.

+1 to renaming Session to *Context.

I don't really see a benefit of passing it into the train_loop_per_worker or trainable, so I would prefer not to unless there is a strong enough reason for this.

Alright, the plan changed to renaming from ray.air import session -> from ray.train import context :)

matthewdeng · 2023-06-27T20:19:00Z

python/ray/train/__init__.py

    "TrainingIterator",
    "TRAIN_DATASET_KEY",
+    "get_dataset_shard",
+    "get_trial_resources",


This is a Tune concept.

We're moving this as well to Train right? You cannot launch trials without resources.

Yes, this is a train concept and being used in train examples (e.g. to set the number of threads for trainers).

Maybe we can discuss this along with:

Make it so you can run any trainable func/class with Trainer.

Right now the interfaces are:
ScalingConfig (Train) -> PlacementGroupFactory (Tune) -> PlacementGroup (Core)

I would say that the usage in Train examples is a misuse of the API and extremely confusing. For setting the number of threads, this is being used as a proxy when we actually just want an API to get the number of CPUs assigned to the RayTrainWorker actor.

Yea the placement group stuff also needs to move in this case.

This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed # Include what you need from the following: from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig # ... def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code. Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>

This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code.

This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of #36706 and #37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…#37023) Having multiple sessions floating around is confusing and we are going to replace the session concept with a unified context object between train and tune going forward (see ray-project#36706) The changes in detail: - Remove the `Session` interface class -- we are not planning to expose it to the user and it just introduces an additional level of abstraction that is not needed / not aligned with the longer term plan of having a unified context object between train and tune - Remove the `_TrainSessionImpl` and `_TuneSessionImpl` and instead push the functionality down into the `_StatusReporter` and the `_TrainSession` -- we might want to rename `_StatusReporter` to `_TuneSession` to be more consistent. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs: - CheckpointConfig - DataConfig - FailureConfig - RunConfig - ScalingConfig available in `ray.train`. No user facing changes yet, the old APIs still work. Going forward, it will be most consistent / symmetrical if these things are included in the following way: ```python from ray import train, tune, serve # Pick the subset that is needed # Include what you need from the following: from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig # ... def train_func(): dataset_shard = train.get_dataset_shard("train") world_size = train.get_context().get_world_size() # ... train.report(...) trainer = train.torch.TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() ``` We have many examples in ray-project#37123 on how this looks like in actual code. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/ Continuation of ray-project#36706 and ray-project#37906 Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

pcmoritz added 9 commits June 22, 2023 18:45

Lightweight Ray AIR API refactor

8d36b16

shuffle

6cfdd91

Merge branch 'master' into lightweight-ray-air-api-refactor

6f2f095

update API

6bdbe14

compat

9d74cc2

update

aa73331

move to internal

3009ae9

update

c26abb1

lint

74be132

pcmoritz changed the title ~~[WIP] Lightweight Ray AIR API refactor~~ [1/n] Lightweight Ray AIR API refactor Jun 26, 2023

pcmoritz assigned ericl Jun 26, 2023

ericl reviewed Jun 26, 2023

View reviewed changes

ericl reviewed Jun 27, 2023

View reviewed changes

update

6685d1f

matthewdeng reviewed Jun 27, 2023

View reviewed changes

pcmoritz added 6 commits June 28, 2023 13:58

update

9709afd

Merge branch 'master' into lightweight-ray-air-api-refactor

c525695

update

1e32b82

add context

2c21498

Merge branch 'master' into lightweight-ray-air-api-refactor

0b00002

fix

90dc9e4

pcmoritz requested review from richardliaw, krfricke, xwjiang2010, amogkam, Yard1, maxpumperla and a team as code owners July 1, 2023 05:52

pcmoritz added 8 commits July 6, 2023 18:18

fix CI

8f1d75e

update

b2acf73

update

3f84416

update docs

d1dd12a

lint

7e8ab7f

update

ca766c8

dice of no dice

eb73fee

remove

e2cf369

pcmoritz merged commit dd029b6 into ray-project:master Jul 8, 2023

This was referenced Jul 8, 2023

[2/n] Lightweight Ray AIR API refactor #37123

Merged

[data][doc] Add DatasetConfig -> DataConfig migration guide #37278

Merged

[REP] Refining the Ray AIR Surface API ray-project/enhancements#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/n] Lightweight Ray AIR API refactor #36706

[1/n] Lightweight Ray AIR API refactor #36706

pcmoritz commented Jun 22, 2023 •

edited

Loading

ericl left a comment

pcmoritz commented Jun 27, 2023 •

edited

Loading

ericl Jun 27, 2023

pcmoritz Jun 27, 2023

ericl Jun 27, 2023

pcmoritz Jun 27, 2023

ericl Jun 27, 2023

pcmoritz Jun 27, 2023

ericl Jun 27, 2023

pcmoritz Jun 27, 2023 •

edited

Loading

matthewdeng Jun 28, 2023

pcmoritz Jun 28, 2023

matthewdeng Jun 27, 2023

ericl Jun 27, 2023

pcmoritz Jun 27, 2023

matthewdeng Jun 28, 2023

ericl Jun 28, 2023

[1/n] Lightweight Ray AIR API refactor #36706

[1/n] Lightweight Ray AIR API refactor #36706

Conversation

pcmoritz commented Jun 22, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

pcmoritz commented Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented Jun 22, 2023 •

edited

Loading

pcmoritz commented Jun 27, 2023 •

edited

Loading

pcmoritz Jun 27, 2023 •

edited

Loading