Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add multi machine dist_train #267

Merged
merged 1 commit into from
Mar 23, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 37 additions & 17 deletions docs/en/quick_run.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ fake_imgs = sample_unconditional_model(model, 4)

Indeed, we have already provided a more friendly demo script to users. You can use [demo/unconditional_demo.py](https://github.com/open-mmlab/mmgeneration/tree/master/mmgen/demo/unconditional_demo.py) with the following commands:

```bash
```shell
python demo/unconditional_demo.py \
${CONFIG_FILE} \
${CHECKPOINT} \
Expand Down Expand Up @@ -69,7 +69,7 @@ fake_imgs = sample_conditional_model(model, 4, label=[0, 1, 2, 3])

Indeed, we have already provided a more friendly demo script to users. You can use [demo/conditional_demo.py](https://github.com/open-mmlab/mmgeneration/tree/master/mmgen/demo/conditional_demo.py) with the following commands:

```bash
```shell
python demo/conditional_demo.py \
${CONFIG_FILE} \
${CHECKPOINT} \
Expand Down Expand Up @@ -108,7 +108,7 @@ translated_image = sample_img2img_model(model, image_path, target_domain='photo'

Indeed, we have already provided a more friendly demo script to users. You can use [demo/translation_demo.py](https://github.com/open-mmlab/mmgeneration/tree/master/mmgen/demo/translation_demo.py) with the following commands:

```bash
```shell
python demo/translation_demo.py \
${CONFIG_FILE} \
${CHECKPOINT} \
Expand All @@ -126,7 +126,7 @@ This section details how to prepare the dataset for MMGeneration and provides a

It's much easier to prepare dataset for unconditional models. Firstly, please make a directory, named `data`, in the MMGeneration project. After that, all of datasets can be used by adopting the technology of symlink (soft link).

```bash
```shell
mkdir data

ln -s absolute_path_to_dataset ./data/dataset_name
Expand Down Expand Up @@ -174,15 +174,15 @@ Here, we provide download links of datasets used in [Pix2Pix](http://efrosgans.e
Currently, we have tested all of the model on distributed training. Thus, we highly recommend to adopt distributed training with our scripts. The basic usage is as follows:

```shell
bash tools/dist_train.sh ${CONFIG_FILE} ${GPUS_NUMBER} \
sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS_NUMBER} \
--work-dir ./work_dirs/experiments/experiments_name \
[optional arguments]
```

If you are using slurm system, the following commands can help you start training"

```shell
bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${WORK_DIR} \
sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${WORK_DIR} \
[optional arguments]
```

Expand All @@ -191,13 +191,33 @@ There two scripts wrap [tools/train.py](https://github.com/open-mmlab/mmgenerati
Note that the name of `work_dirs` has already been put into our `.gitignore` file. Users can put any files here without concern about changing git related files. Here is an example command that we use to train our `1024x1024 StyleGAN2 ` model.

```shell
bash tools/slurm_train.sh openmmlab-platform stylegan2-1024 \
sh tools/slurm_train.sh openmmlab-platform stylegan2-1024 \
configs/styleganv2/stylegan2_c2_ffhq_1024_b4x8.py \
work_dirs/experiments/stylegan2_c2_ffhq_1024_b4x8
```

During training, log files and checkpoints will be saved to the working directory. At the beginning of our development, we evaluate our model after the training finishes. However, the evaluation hook has been already supported to evaluate our models in the training procedure. More details can be found in our tutorial for running time configuration.

## Training with multiple machines

If you launch with multiple machines simply connected with ethernet, you can simply run following commands:

On the first machine:

```shell
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
```

On the second machine:

```shell
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
```

Usually it is slow if you do not have high speed networking like InfiniBand.

If you launch with slurm, the command is the same as that on single machine described above, but you need refer to [slurm_train.sh](https://github.com/open-mmlab/mmgeneration/blob/master/tools/slurm_train.sh) to set appropriate parameters and environment variables.

## Training on CPU

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
Expand Down Expand Up @@ -231,12 +251,12 @@ metrics = dict(
Then, users can use the evaluation script with the following command:

```shell
bash eval.sh ${CONFIG_FILE} ${CKPT_FILE} --batch-size 10 --online
sh eval.sh ${CONFIG_FILE} ${CKPT_FILE} --batch-size 10 --online
```
If you are in slurm environment, please switch to the [tools/slurm_eval.sh](https://github.com/open-mmlab/mmgeneration/tree/master/tools/slurm_eval.sh) by using the following commands:

```shell
bash slurm_eval.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} \
sh slurm_eval.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} \
--batch-size 10
--online
```
Expand All @@ -245,10 +265,10 @@ As you can see, we have provided two modes for evaluating your models, i.e., `on

```shell
# for general envs
bash eval.sh ${CONFIG_FILE} ${CKPT_FILE} --eval none
sh eval.sh ${CONFIG_FILE} ${CKPT_FILE} --eval none

# for slurm
bash slurm_eval.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} \
sh slurm_eval.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} \
--eval none
```

Expand All @@ -260,19 +280,19 @@ python tools/utils/translation_eval.py ${CONFIG_FILE} ${CKPT_FILE} --t ${target-
To be noted that, in current version of MMGeneration, we support multi GPUs for [FID](#fid) and [IS](#is) evaluation and image saving. You can use the following command to use this feature:
```shell
# online evaluation
bash dist_eval.sh ${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER} --batch-size 10 --online
sh dist_eval.sh ${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER} --batch-size 10 --online
# online evaluation with slurm
bash slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} --batch-size 10 --online
sh slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} --batch-size 10 --online

# offline evaluation
bash dist_eval.sh${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER}
sh dist_eval.sh${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER}
# offline evaluation with slurm
bash slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE}
sh slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE}

# image saving
bash dist_eval.sh${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER} --eval none --samples-path ${SAMPLES_PATH}
sh dist_eval.sh${CONFIG_FILE} ${CKPT_FILE} ${GPUS_NUMBER} --eval none --samples-path ${SAMPLES_PATH}
# image saving with slurm
bash slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} --eval none --samples-path ${SAMPLES_PATH}
sh slurm_eval_multi_gpu.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE} --eval none --samples-path ${SAMPLES_PATH}
```
In the subsequent version, multi GPUs evaluation for more metrics will be supported.

Expand Down
18 changes: 15 additions & 3 deletions tools/dist_eval.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,22 @@
#!/usr/bin/env bash

CONFIG=$1
CKPT=$2
CHECKPOINT=$2
GPUS=$3
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
tools/evaluation.py ${CONFIG} ${CKPT} --launcher pytorch ${@:4}
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/test.py \
$CONFIG \
$CHECKPOINT \
--launcher pytorch \
${@:4}
15 changes: 13 additions & 2 deletions tools/dist_train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,19 @@

CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
tools/train.py $CONFIG --launcher pytorch ${@:3}
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/train.py \
$CONFIG \
--seed 0 \
--launcher pytorch ${@:3}