-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeepSpeed Example with Pytorch Operator #2235
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
FROM deepspeed/deepspeed:v072_torch112_cu117 | ||
|
||
RUN apt update | ||
RUN apt install -y ninja-build | ||
|
||
WORKDIR / | ||
COPY requirements.txt . | ||
COPY train_bert_ds.py . | ||
|
||
RUN pip install -r requirements.txt | ||
RUN mkdir -p /root/deepspeed_data |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## Training a Masked Language Model with PyTorch and DeepSpeed | ||
|
||
This folder contains an example of training a Masked Language Model with PyTorch and DeepSpeed. | ||
|
||
The python script used to train BERT with PyTorch and DeepSpeed. For more information, please refer to the [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md). | ||
|
||
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. | ||
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment). | ||
|
||
This guide will show you how to deploy DeepSpeed with the `torchrun` launcher. | ||
The simplest way to quickly reproduce the following is to switch to the DeepSpeedExamples commit: | ||
```shell | ||
git clone https://github.com/microsoft/DeepSpeedExamples.git | ||
cd DeepSpeedExamples | ||
git checkout efacebb | ||
``` | ||
|
||
The script train_bert_ds.py is located in the DeepSpeedExamples/HelloDeepSpeed/ directory. | ||
Since the script is not launched using the deepspeed launcher, it needs to read the local_rank from the environment. | ||
The following content has been added at line 670: | ||
``` | ||
local_rank = int(os.getenv('LOCAL_RANK', '-1')) | ||
``` | ||
|
||
### Build Image | ||
|
||
The default image name and tag is `kubeflow/pytorch-deepspeed-demo:latest`. | ||
|
||
```shell | ||
docker build -f Dockerfile -t kubeflow/pytorch-deepspeed-demo:latest ./ | ||
``` | ||
|
||
### Create the PyTorchJob with DeepSpeed example | ||
|
||
```shell | ||
kubectl create -f pytorch_deepspeed_demo.yaml | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
apiVersion: "kubeflow.org/v1" | ||
kind: PyTorchJob | ||
metadata: | ||
name: pytorch-deepspeed-demo | ||
spec: | ||
pytorchReplicaSpecs: | ||
Master: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you need Master replica for this example ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)
so the command can be simplified as follows:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, so we need Master replica for this example. |
||
replicas: 1 | ||
restartPolicy: OnFailure | ||
template: | ||
spec: | ||
containers: | ||
- name: pytorch | ||
image: kubeflow/pytorch-deepspeed-demo:latest | ||
command: | ||
- torchrun | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to set the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I didn't set them during my testing because there was only one GPU on the node. I have added them now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Syulin7 No, actually, you no need to set the parameters for If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kuizhiqing @Syulin7 Aren't those ENV variables depreciated right now and we should just use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich I'm not aware of any deprecation of ENV variables usage from the pytorch part. For operator, I think the operator should take care of all that kind of staff for the user, that's one reason why people should use the operator. Otherwise, people can use Job or even Statefulset or some other kind of interface to run pods in kubernetes, then launch the distributed job with argument with torchrun. So, in my opinion, to run a pytorch distributed job one can do it simply with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense @kuizhiqing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich Yes, you are right. I don't think they have docs specifically to explain the env usage since the usage of arguments and envs are processed equally. Using env is the right way for operator to interact with program, and it was the original design of operator, c.f. #1573. Well, it's indeed a little bit confusion for user to understand the meaning of all the env, especially when it comes to PET_ prefix and switch with other frameworks, c.f. #1840. One more thing is worth mention that, the argument can not be change dynamically (I mean by the operator, with restart pod or something, env will not change after process creation), while env can. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@kuizhiqing @andreyvelich Agree, we should rely on Training Operator to setup envs. |
||
- /train_bert_ds.py | ||
- --checkpoint_dir | ||
- /root/deepspeed_data | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
Worker: | ||
replicas: 1 | ||
restartPolicy: OnFailure | ||
template: | ||
spec: | ||
containers: | ||
- name: pytorch | ||
image: kubeflow/pytorch-deepspeed-demo:latest | ||
command: | ||
- torchrun | ||
- /train_bert_ds.py | ||
- --checkpoint_dir | ||
- /root/deepspeed_data | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
datasets==1.13.3 | ||
transformers==4.5.1 | ||
fire==0.4.0 | ||
pytz==2021.1 | ||
loguru==0.5.3 | ||
sh==1.14.2 | ||
pytest==6.2.5 | ||
tqdm==4.62.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set the appropriate env variables for
deepspeed
oraccelerate
launchers in PyTorchJob or onlytorchrun
can be used ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using
deepspeed
launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.
About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187
I can add an example in mpi-operator (mpi v2) later.
In PyTorchJob,
torchrun
andaccelerate
can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific
deepspeed
runtime that will leverage MPI orchestration to create hostfiles.As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that
torchrun
doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepspeed
is already compatible with mpi-operator (the one outside of training-operator)Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to accept a PR for this in the mpi-operator repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.