Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/n] Lightweight Ray AIR API refactor #36706

Merged

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Jun 22, 2023

Why are these changes needed?

This PR removes some circularities in the Ray AIR import system so we can put the training related functions into ray.train. It introduces a training context and makes report, get_dataset_shard, get_context, Checkpoint, Result, and the following configs:

  • CheckpointConfig
  • DataConfig
  • FailureConfig
  • RunConfig
  • ScalingConfig

available in ray.train. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

from ray import train, tune, serve # Pick the subset that is needed
# Include what you need from the following:
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

# ...

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

We have many examples in #37123 on how this looks like in actual code.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pcmoritz pcmoritz changed the title [WIP] Lightweight Ray AIR API refactor [1/n] Lightweight Ray AIR API refactor Jun 26, 2023
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Is this actually the bulk of the changes (aside from checkpoint class)? I guess I can think of the following followups:

  • Make it so you can run any trainable func/class with Trainer.
  • Update docs/deprecate old APIs.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jun 27, 2023

This is what I have seen so far from Goku's course. There is also the Result type that is currently still in AIR. Make it so you can run any trainable func/class with Trainer. <- let's chat more about this, I'm actually not super sure what needs to be done here.

After this is merged, I will start working on migrating our examples / deprecation, but I don't expect it to land until after the 2.6 branch cut :)


usage_lib.record_library_usage("train")

__all__ = [
"BackendConfig",
"Checkpoint",
"CheckpointConfig",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to avoid moving the Checkpoint class if we likely will remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I will remove them.

"get_dataset_shard",
"get_trial_resources",
"get_world_rank",
"get_world_size",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these accessors be part of session only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get rid of the session namespace going forward. It serves no purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should do that atomically then, it seems odd to move just a few specific methods initially. It is also a big change and I'm not clear if the pros outweigh the cons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't think get_local_rank and get_local_world_size are currently being used anywhere (they are internal APIs and used to automatically set an environment variable for TorchTrainer). So I'd like to get rid of them. Should I be wrong I'm happy to add them later. So is get_node_rank (which is also very confusing btw). Btw, the docs strings for all of these except get_local_rank is wrong too.

For get_trial_dir, get_trial_id, get_trial_name, get_experiment_name I would first like to understand how your checkpointing changes are changing things, but I'm happy to add them back if you think they will keep being the same.

get_checkpoint will probably become irrelevant.

That's already all the methods in session.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are all important utility methods. The more general concern is that there will always be more and more of these small things, just see ray.get_runtime_context() for example (I believe session should be renamed something like context, IMO).

We should plan for a large number of utility methods existing here in the future, and I don't think it's great to have them directly scoped under train.

Copy link
Contributor Author

@pcmoritz pcmoritz Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my ideal API for this would be

def train_loop_per_worker(context):
    # Use context.report(...), context.get_dataset_shard(...) etc.

Let me look into whether this can be made to happen in a compatible way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to renaming Session to *Context.

I don't really see a benefit of passing it into the train_loop_per_worker or trainable, so I would prefer not to unless there is a strong enough reason for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, the plan changed to renaming from ray.air import session -> from ray.train import context :)

"TrainingIterator",
"TRAIN_DATASET_KEY",
"get_dataset_shard",
"get_trial_resources",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Tune concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving this as well to Train right? You cannot launch trials without resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a train concept and being used in train examples (e.g. to set the number of threads for trainers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can discuss this along with:

Make it so you can run any trainable func/class with Trainer.

Right now the interfaces are:
ScalingConfig (Train) -> PlacementGroupFactory (Tune) -> PlacementGroup (Core)

I would say that the usage in Train examples is a misuse of the API and extremely confusing. For setting the number of threads, this is being used as a proxy when we actually just want an API to get the number of CPUs assigned to the RayTrainWorker actor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the placement group stuff also needs to move in this case.

@pcmoritz pcmoritz merged commit dd029b6 into ray-project:master Jul 8, 2023
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 11, 2023
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs:

- CheckpointConfig
- DataConfig
- FailureConfig
- RunConfig
- ScalingConfig

available in `ray.train`. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

```python
from ray import train, tune, serve # Pick the subset that is needed
# Include what you need from the following:
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

# ...

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()
```

We have many examples in ray-project#37123 on how this looks like in actual code.

Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 28, 2023
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs:

- CheckpointConfig
- DataConfig
- FailureConfig
- RunConfig
- ScalingConfig

available in `ray.train`. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

```python
from ray import train, tune, serve # Pick the subset that is needed
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()
```

We have many examples in ray-project#37123 on how this looks like in actual code.
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 28, 2023
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs:

- CheckpointConfig
- DataConfig
- FailureConfig
- RunConfig
- ScalingConfig

available in `ray.train`. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

```python
from ray import train, tune, serve # Pick the subset that is needed
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()
```

We have many examples in ray-project#37123 on how this looks like in actual code.
pcmoritz added a commit that referenced this pull request Aug 2, 2023
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/

Continuation of #36706 and #37906

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/

Continuation of ray-project#36706 and ray-project#37906

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…#37023)

Having multiple sessions floating around is confusing and we are going to replace the session concept with a unified context object between train and tune going forward (see ray-project#36706)

The changes in detail:

- Remove the `Session` interface class -- we are not planning to expose it to the user and it just introduces an additional level of abstraction that is not needed / not aligned with the longer term plan of having a unified context object between train and tune

- Remove the `_TrainSessionImpl` and `_TuneSessionImpl` and instead push the functionality down into the `_StatusReporter` and the `_TrainSession` -- we might want to rename `_StatusReporter` to `_TuneSession` to be more consistent.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This PR removes some circularities in the Ray AIR import system so we can put the training related functions into `ray.train`. It introduces a training context and makes report, get_dataset_shard, Checkpoint, Result, and the following configs:

- CheckpointConfig
- DataConfig
- FailureConfig
- RunConfig
- ScalingConfig

available in `ray.train`. No user facing changes yet, the old APIs still work.

Going forward, it will be most consistent / symmetrical if these things are included in the following way:

```python
from ray import train, tune, serve # Pick the subset that is needed
# Include what you need from the following:
from ray.train import CheckpointConfig, DataConfig, FailureConfig, RunConfig, ScalingConfig

# ...

def train_func():
    dataset_shard = train.get_dataset_shard("train")
    world_size = train.get_context().get_world_size()
    # ...
    train.report(...)

trainer = train.torch.TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()
```

We have many examples in ray-project#37123 on how this looks like in actual code.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/

Continuation of ray-project#36706 and ray-project#37906

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
This PR migrates all the train and tune examples and docstrings to the new API convention, see https://github.com/ray-project/enhancements/

Continuation of ray-project#36706 and ray-project#37906

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants