[Train] Implement `AccelerateTrainer` #33269

Yard1 · 2023-03-13T21:59:27Z

Why are these changes needed?

Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

…nto accelerate_trainer_2

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

…nto accelerate_trainer_2

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

python/ray/train/huggingface/accelerate/accelerate_trainer.py

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

gjoliver · 2023-03-21T08:25:53Z

doc/requirements-doc.txt

@@ -1,6 +1,7 @@
 # Production requirements. This is what readthedocs.org picks up

 # Python / ML libraries
+accelerate>=0.17.0


why do we need the actual accelerate library for the docs?

This is because the python files are imported during doc build. If a library is missing, then the doc build fails. The alternative is to mock the module, but some modules like accelerate which do unorthodox stuff on import can still fail with that.

gjoliver · 2023-03-21T08:27:43Z

doc/source/ray-air/doc_code/accelerate_trainer.py

+input_size = 1
+layer_size = 15
+output_size = 1
+num_epochs = 3


turn these into flags?
alternatively, they should be ALL_CAP if these are constants

this is the same as in doc/source/ray-air/doc_code/torch_trainer.py and doc/source/ray-air/doc_code/hvd_trainer.py

so they are all wrong then: https://peps.python.org/pep-0008/#constants

gjoliver · 2023-03-21T08:29:20Z

doc/source/ray-air/doc_code/accelerate_trainer.py

+            print(f"epoch: {epoch}, loss: {loss.item()}")
+
+        session.report(
+            {},


can you add comments so we know what this empty dict is?
it's metrics dict right? maybe it's better to send loss as a metric

this is the same as in doc/source/ray-air/doc_code/torch_trainer.py and doc/source/ray-air/doc_code/hvd_trainer.py

wait, why do we want to carry "unclear" stuff into all new examples?

doc/source/ray-air/trainers.rst

python/ray/train/BUILD

python/ray/train/huggingface/accelerate/_accelerate_utils.py

gjoliver · 2023-03-21T08:42:02Z

python/ray/train/huggingface/accelerate/accelerate_trainer.py

+    .. note::
+
+        You need to use ``session.report()`` to communicate results and checkpoints
+        back to Ray Train.


Those are Train docs :) The fact we are using Ray Tune is an implementation detail users don't need to know about here

gjoliver · 2023-03-21T08:43:01Z

python/ray/train/huggingface/accelerate/accelerate_trainer.py

+            layer_size = 32
+            output_size = 1
+            num_epochs = 200
+            num_workers = 3


same comment here. can we not duplicate this code?

This is the same as in the TorchTrainer docstring

come in and update all of them all at once?

gjoliver · 2023-03-21T08:44:21Z

python/ray/train/huggingface/accelerate/accelerate_trainer.py

+        train_loop_per_worker: Union[Callable[[], None], Callable[[Dict], None]],
+        *,
+        train_loop_config: Optional[Dict] = None,
+        accelerate_config: Optional[Union[dict, str, Path, os.PathLike]] = None,


this looks a bit weird, why is the new config param randomly placed at the 2nd place?
move it to the top?

to me it felt it was nice to have the train_loop_per_worker and train_loop_config together, like in other Trainers - those are all keyword only arguments regardless (you can't pass them positionally)

gjoliver · 2023-03-21T08:47:41Z

python/requirements/ml/requirements_train.txt

+# If changing the version here, also change it in AccelerateTrainer docstring
+accelerate==0.5.1; python_version <= '3.6'
+accelerate==0.17.1; python_version > '3.6'
+deepspeed; python_version > '3.6'


hmm, now deepspeed will always get installed regardless of whether users actually need it or not?

Only for CI, this should not be included in docker.

It's actually very lightweight as it does lazy compilation (the package by itself weights almost nothing and has like 2 dependencies)

Yard1 · 2023-03-21T17:11:23Z

@gjoliver in regards to doc changes, seeing as those would concern multiple docs & docstrings, how about we do that in a followup PR

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

gjoliver · 2023-03-21T22:49:47Z

@gjoliver in regards to doc changes, seeing as those would concern multiple docs & docstrings, how about we do that in a followup PR

ah, ok, I read the comments in github order, so didn't see this until now.
yeah, followup PR sounds good to me.

gjoliver

ok, I think we need to do some cleanups for some of the examples.
but those can be done as a followup PR.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

Yard1 added 7 commits March 13, 2023 20:27

WIP

4bcb4e3

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

WIP

673cd80

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

WIP

f4469e1

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'ray-project:master' into accelerate_trainer_2

e838585

WIP

5595dd3

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fix

6bf6603

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Bump accelerate version

12d9710

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 mentioned this pull request Mar 13, 2023

[no_early_kickoff][Train] ray.train.huggingface restructure #33278

Merged

7 tasks

Yard1 added 9 commits March 13, 2023 23:49

Fix

01c8f58

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Increase timeout

89878fd

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fix

aba3048

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fix

55ae6fc

Fix

d5e217d

Lint

dddf9a3

Merge branch 'ray-project:master' into accelerate_trainer_2

b82659b

Fix tests

b41b515

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Add docs

6f56f50

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 mentioned this pull request Mar 14, 2023

[feat] Ray train integration CarperAI/trlx#312

Merged

Yard1 added 10 commits March 14, 2023 20:53

Fix

755ba42

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Tweak

729e680

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Remove from init

78ae3f7

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Add to docs

491d925

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Change path

f7b624f

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into accelerate_trainer_2

55b111f

Mock accelerate in docs

2ce366b

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Install accelerate for real for docs

75ef2dd

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into accelerate_trainer_2

41e99df

Docs

606982d

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 marked this pull request as ready for review March 15, 2023 22:04

Yard1 requested a review from richardliaw as a code owner March 15, 2023 22:04

Yard1 added 8 commits March 16, 2023 17:03

Link to reference

a21da7a

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'accelerate_trainer_2' of https://github.com/Yard1/ray i…

6340a96

…nto accelerate_trainer_2

Merge branch 'ray-project:master' into accelerate_trainer_2

7d41a60

Install deepspeed directly

f46b0dc

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into accelerate_trainer_2

8d076c6

Merge branch 'accelerate_trainer_2' of https://github.com/Yard1/ray i…

6e6c3d1

…nto accelerate_trainer_2

Fix test

9149e09

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Remove comment

d3c9ccd

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

woshiyyya reviewed Mar 20, 2023

View reviewed changes

Remove unneeded vars, add comments

e534d0b

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

gjoliver reviewed Mar 21, 2023

View reviewed changes

Yard1 added 2 commits March 21, 2023 17:12

Add comments, revert test time change

950ba16

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into accelerate_trainer_2

b170e39

gjoliver approved these changes Mar 21, 2023

View reviewed changes

Yard1 added 3 commits March 21, 2023 23:04

Improve comments

415ebb3

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Clarify doc

6fcc273

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into accelerate_trainer_2

761d319

Yard1 merged commit 1ed7c5a into ray-project:master Mar 23, 2023

Yard1 deleted the accelerate_trainer_2 branch March 23, 2023 00:19

Yard1 mentioned this pull request Mar 23, 2023

[Train] Clean up examples in docstrings #33608

Open

pacman100 mentioned this pull request Mar 24, 2023

It is SUPER unclear how to run multi-node distributed training with HuggingFace Accelerate huggingface/accelerate#1242

Closed

richardliaw mentioned this pull request Mar 26, 2023

Can we launch accelerate from a Python script instead of CLI? huggingface/accelerate#804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Implement `AccelerateTrainer` #33269

[Train] Implement `AccelerateTrainer` #33269

Yard1 commented Mar 13, 2023 •

edited

Loading

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

gjoliver Mar 21, 2023

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

gjoliver Mar 21, 2023

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023 •

edited

Loading

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

gjoliver Mar 21, 2023

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

gjoliver Mar 21, 2023

Yard1 Mar 21, 2023

Yard1 commented Mar 21, 2023

gjoliver commented Mar 21, 2023

gjoliver left a comment

[Train] Implement AccelerateTrainer #33269

[Train] Implement AccelerateTrainer #33269

Conversation

Yard1 commented Mar 13, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Mar 21, 2023

gjoliver commented Mar 21, 2023

gjoliver left a comment

Choose a reason for hiding this comment

[Train] Implement `AccelerateTrainer` #33269

[Train] Implement `AccelerateTrainer` #33269

Yard1 commented Mar 13, 2023 •

edited

Loading

Yard1 Mar 21, 2023 •

edited

Loading