Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Implement AccelerateTrainer #33269

Merged
merged 43 commits into from
Mar 23, 2023
Merged

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Mar 13, 2023

Why are these changes needed?

Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Yard1 added 7 commits March 13, 2023 20:27
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Yard1 added 9 commits March 13, 2023 23:49
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Yard1 added 10 commits March 14, 2023 20:53
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@Yard1 Yard1 marked this pull request as ready for review March 15, 2023 22:04
@Yard1 Yard1 requested a review from richardliaw as a code owner March 15, 2023 22:04
Yard1 added 8 commits March 16, 2023 17:03
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@@ -1,6 +1,7 @@
# Production requirements. This is what readthedocs.org picks up

# Python / ML libraries
accelerate>=0.17.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the actual accelerate library for the docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the python files are imported during doc build. If a library is missing, then the doc build fails. The alternative is to mock the module, but some modules like accelerate which do unorthodox stuff on import can still fail with that.

input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn these into flags?
alternatively, they should be ALL_CAP if these are constants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as in doc/source/ray-air/doc_code/torch_trainer.py and doc/source/ray-air/doc_code/hvd_trainer.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so they are all wrong then: https://peps.python.org/pep-0008/#constants

print(f"epoch: {epoch}, loss: {loss.item()}")

session.report(
{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments so we know what this empty dict is?
it's metrics dict right? maybe it's better to send loss as a metric

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as in doc/source/ray-air/doc_code/torch_trainer.py and doc/source/ray-air/doc_code/hvd_trainer.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, why do we want to carry "unclear" stuff into all new examples?

doc/source/ray-air/trainers.rst Outdated Show resolved Hide resolved
python/ray/train/BUILD Outdated Show resolved Hide resolved
.. note::

You need to use ``session.report()`` to communicate results and checkpoints
back to Ray Train.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray Tune?

Copy link
Member Author

@Yard1 Yard1 Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are Train docs :) The fact we are using Ray Tune is an implementation detail users don't need to know about here

layer_size = 32
output_size = 1
num_epochs = 200
num_workers = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here. can we not duplicate this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as in the TorchTrainer docstring

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

come in and update all of them all at once?

train_loop_per_worker: Union[Callable[[], None], Callable[[Dict], None]],
*,
train_loop_config: Optional[Dict] = None,
accelerate_config: Optional[Union[dict, str, Path, os.PathLike]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit weird, why is the new config param randomly placed at the 2nd place?
move it to the top?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it felt it was nice to have the train_loop_per_worker and train_loop_config together, like in other Trainers - those are all keyword only arguments regardless (you can't pass them positionally)

# If changing the version here, also change it in AccelerateTrainer docstring
accelerate==0.5.1; python_version <= '3.6'
accelerate==0.17.1; python_version > '3.6'
deepspeed; python_version > '3.6'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, now deepspeed will always get installed regardless of whether users actually need it or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for CI, this should not be included in docker.

It's actually very lightweight as it does lazy compilation (the package by itself weights almost nothing and has like 2 dependencies)

@Yard1
Copy link
Member Author

Yard1 commented Mar 21, 2023

@gjoliver in regards to doc changes, seeing as those would concern multiple docs & docstrings, how about we do that in a followup PR

@gjoliver
Copy link
Member

@gjoliver in regards to doc changes, seeing as those would concern multiple docs & docstrings, how about we do that in a followup PR

ah, ok, I read the comments in github order, so didn't see this until now.
yeah, followup PR sounds good to me.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think we need to do some cleanups for some of the examples.
but those can be done as a followup PR.

Yard1 added 3 commits March 21, 2023 23:04
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@Yard1 Yard1 merged commit 1ed7c5a into ray-project:master Mar 23, 2023
@Yard1 Yard1 deleted the accelerate_trainer_2 branch March 23, 2023 00:19
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants