Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepSpeed Example with Pytorch Operator #2235

Merged
merged 1 commit into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/publish-example-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,7 @@ jobs:
platforms: linux/amd64,linux/arm64
dockerfile: examples/jax/cpu-demo/Dockerfile
context: examples/jax/cpu-demo
- component-name: pytorch-deepspeed-demo
platforms: linux/amd64
dockerfile: examples/pytorch/deepspeed-demo/Dockerfile
context: examples/pytorch/deepspeed-demo
11 changes: 11 additions & 0 deletions examples/pytorch/deepspeed-demo/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM deepspeed/deepspeed:v072_torch112_cu117

RUN apt update
RUN apt install -y ninja-build

WORKDIR /
COPY requirements.txt .
COPY train_bert_ds.py .

RUN pip install -r requirements.txt
RUN mkdir -p /root/deepspeed_data
37 changes: 37 additions & 0 deletions examples/pytorch/deepspeed-demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Training a Masked Language Model with PyTorch and DeepSpeed

This folder contains an example of training a Masked Language Model with PyTorch and DeepSpeed.

The python script used to train BERT with PyTorch and DeepSpeed. For more information, please refer to the [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md).

DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).
Comment on lines +7 to +8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set the appropriate env variables for deepspeed or accelerate launchers in PyTorchJob or only torchrun can be used ?

Copy link
Contributor Author

@Syulin7 Syulin7 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using deepspeed launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.

# deepspeed --hostfile path default at /job/hostfile
deepspeed --hostfile=/etc/mpi/hostfile /train_bert_ds.py --checkpoint_dir /root/deepspeed_data

About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187

# hostfile
worker-1 slots=4
worker-2 slots=4

I can add an example in mpi-operator (mpi v2) later.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific deepspeed runtime that will leverage MPI orchestration to create hostfiles.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that torchrun doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to accept a PR for this in the mpi-operator repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.


This guide will show you how to deploy DeepSpeed with the `torchrun` launcher.
The simplest way to quickly reproduce the following is to switch to the DeepSpeedExamples commit:
```shell
git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples
git checkout efacebb
```

The script train_bert_ds.py is located in the DeepSpeedExamples/HelloDeepSpeed/ directory.
Since the script is not launched using the deepspeed launcher, it needs to read the local_rank from the environment.
The following content has been added at line 670:
```
local_rank = int(os.getenv('LOCAL_RANK', '-1'))
```

### Build Image

The default image name and tag is `kubeflow/pytorch-deepspeed-demo:latest`.

```shell
docker build -f Dockerfile -t kubeflow/pytorch-deepspeed-demo:latest ./
```

### Create the PyTorchJob with DeepSpeed example

```shell
kubectl create -f pytorch_deepspeed_demo.yaml
```
38 changes: 38 additions & 0 deletions examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-deepspeed-demo
spec:
pytorchReplicaSpecs:
Master:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need Master replica for this example ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)

# node1
torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>

# node2
torchrun --nproc_per_node=8 --nnode=2 --node_rank=1 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args>

so the command can be simplified as follows:

torchrun --nproc_per_node=8 --nnode=2 your_program.py <normal cl args>

See: https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#multi-node-deployment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set.

Yes, so we need Master replica for this example.

replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-deepspeed-demo:latest
command:
- torchrun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set the --nnodes and --nproc_per_node parameters for torchrun ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't set them during my testing because there was only one GPU on the node. I have added them now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 No, actually, you no need to set the parameters for torchrun, setting the correct environment related parameters(or using env in the operator case) is the responsibility of the operator.

If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator.

Copy link
Member

@andreyvelich andreyvelich Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing @Syulin7 Aren't those ENV variables depreciated right now and we should just use torchrun CLI argument ?
I remember that we discussed with @tenzen-y previously that PET_NPROC_PER_NODE and PET_NNODES is deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I'm not aware of any deprecation of ENV variables usage from the pytorch part.
In the link it mentioned that --use-env is deprecated for torchrun, it means the usage of args like torchrun --use-env is invalid anymore. Anyway, the cli argument and environment are both valid way and identical to set environ related information to the pytorch framework.

For operator, I think the operator should take care of all that kind of staff for the user, that's one reason why people should use the operator. Otherwise, people can use Job or even Statefulset or some other kind of interface to run pods in kubernetes, then launch the distributed job with argument with torchrun.

So, in my opinion, to run a pytorch distributed job one can do it simply with torchrun train.py, the operator make it happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense @kuizhiqing.
I see, I found this how the use the PET_ env variables with torchrun
https://github.com/pytorch/pytorch/blob/main/torch/distributed/argparse_util.py#L44-L46
Do you know if they have any PyTorch docs which explains it ?

Copy link
Member

@kuizhiqing kuizhiqing Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Yes, you are right. I don't think they have docs specifically to explain the env usage since the usage of arguments and envs are processed equally.

Using env is the right way for operator to interact with program, and it was the original design of operator, c.f. #1573.

Well, it's indeed a little bit confusion for user to understand the meaning of all the env, especially when it comes to PET_ prefix and switch with other frameworks, c.f. #1840.

One more thing is worth mention that, the argument can not be change dynamically (I mean by the operator, with restart pod or something, env will not change after process creation), while env can.

Copy link
Member

@andreyvelich andreyvelich Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
@Syulin7 In that case, let's rely on Training Operator to setup those torchrun arguments based on the PyTorchJob specification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in my opinion, to run a pytorch distributed job one can do it simply with torchrun train.py, the operator make it happened.

@kuizhiqing @andreyvelich Agree, we should rely on Training Operator to setup envs.

- /train_bert_ds.py
- --checkpoint_dir
- /root/deepspeed_data
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-deepspeed-demo:latest
command:
- torchrun
- /train_bert_ds.py
- --checkpoint_dir
- /root/deepspeed_data
resources:
limits:
nvidia.com/gpu: 1
8 changes: 8 additions & 0 deletions examples/pytorch/deepspeed-demo/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
datasets==1.13.3
transformers==4.5.1
fire==0.4.0
pytz==2021.1
loguru==0.5.3
sh==1.14.2
pytest==6.2.5
tqdm==4.62.3
Loading
Loading