-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Implement AccelerateTrainer
#33269
Conversation
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
…nto accelerate_trainer_2
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
…nto accelerate_trainer_2
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@@ -1,6 +1,7 @@ | |||
# Production requirements. This is what readthedocs.org picks up | |||
|
|||
# Python / ML libraries | |||
accelerate>=0.17.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the actual accelerate library for the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the python files are imported during doc build. If a library is missing, then the doc build fails. The alternative is to mock the module, but some modules like accelerate which do unorthodox stuff on import can still fail with that.
input_size = 1 | ||
layer_size = 15 | ||
output_size = 1 | ||
num_epochs = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turn these into flags?
alternatively, they should be ALL_CAP if these are constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as in doc/source/ray-air/doc_code/torch_trainer.py
and doc/source/ray-air/doc_code/hvd_trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so they are all wrong then: https://peps.python.org/pep-0008/#constants
print(f"epoch: {epoch}, loss: {loss.item()}") | ||
|
||
session.report( | ||
{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add comments so we know what this empty dict is?
it's metrics dict right? maybe it's better to send loss as a metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as in doc/source/ray-air/doc_code/torch_trainer.py
and doc/source/ray-air/doc_code/hvd_trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, why do we want to carry "unclear" stuff into all new examples?
.. note:: | ||
|
||
You need to use ``session.report()`` to communicate results and checkpoints | ||
back to Ray Train. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Tune?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are Train docs :) The fact we are using Ray Tune is an implementation detail users don't need to know about here
layer_size = 32 | ||
output_size = 1 | ||
num_epochs = 200 | ||
num_workers = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here. can we not duplicate this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as in the TorchTrainer docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
come in and update all of them all at once?
train_loop_per_worker: Union[Callable[[], None], Callable[[Dict], None]], | ||
*, | ||
train_loop_config: Optional[Dict] = None, | ||
accelerate_config: Optional[Union[dict, str, Path, os.PathLike]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit weird, why is the new config param randomly placed at the 2nd place?
move it to the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me it felt it was nice to have the train_loop_per_worker
and train_loop_config
together, like in other Trainers - those are all keyword only arguments regardless (you can't pass them positionally)
# If changing the version here, also change it in AccelerateTrainer docstring | ||
accelerate==0.5.1; python_version <= '3.6' | ||
accelerate==0.17.1; python_version > '3.6' | ||
deepspeed; python_version > '3.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, now deepspeed will always get installed regardless of whether users actually need it or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for CI, this should not be included in docker.
It's actually very lightweight as it does lazy compilation (the package by itself weights almost nothing and has like 2 dependencies)
@gjoliver in regards to doc changes, seeing as those would concern multiple docs & docstrings, how about we do that in a followup PR |
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
ah, ok, I read the comments in github order, so didn't see this until now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think we need to do some cleanups for some of the examples.
but those can be done as a followup PR.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>
Implements the AccelerateTrainer, providing an integration with Hugging Face Accelerate in Ray Train. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
Implements the
AccelerateTrainer
, providing an integration with Hugging Face Accelerate in Ray Train.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.