Distributed computing (eg, multi-GPU) support #13

erensezener · 2021-08-27T12:09:36Z

I see that there is some code supporting multi-GPUs, eg here and here.

However, I don't see an option/flag to actually utilize distributed computing. Could you clarify?

Thank you.

fabrahman · 2021-11-01T00:19:43Z

@erensezener Did you figure this out? :)

fabrahman · 2021-11-02T21:49:15Z

@gizacard Would you mind providing some instruction on this? which options should be set? Thanks

fabrahman · 2021-11-02T22:47:09Z

@gizacard I wanted to train using Multi-GPU (4 gpus) , and for that I used the local_rank=0, and set following env variables:

RANK=0
NGPU=4
WORLD_SIZE=4

Although I am not aiming for slurs job the code here require me to set MASTER_ADDR and MASTER_PORT as well. why? Anyway I set them to be my server ip and a port.

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

Can you guide me if I am doing correct? Thanks

gowtham1997 · 2022-09-14T17:02:36Z

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

Duemoo · 2022-11-27T05:47:19Z

@fabrahman Could you provide an update on this issue? I have exactly the same issue, and I found that the code freezes without any error message after executing line 194 of train_reader.py

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

szh-max · 2023-07-11T16:45:19Z

@Duemoo I also encountered this problem, using multiple gpu, I found that the code freezes without any error message after executing line 194 of train_reader.py，
how did you solve it? Thanks!

bobbyfyb · 2024-10-17T02:00:14Z

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

@szh-max Hi I also encountered this problem and I solved it by updating torch version to torch==1.10.0 and export NCCL_P2P_DISABLE=1 from here:
https://discuss.pytorch.org/t/distributed-data-parallel-freezes-without-error-message/8009/29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed computing (eg, multi-GPU) support #13

Distributed computing (eg, multi-GPU) support #13

erensezener commented Aug 27, 2021

fabrahman commented Nov 1, 2021

fabrahman commented Nov 2, 2021

fabrahman commented Nov 2, 2021

gowtham1997 commented Sep 14, 2022

Duemoo commented Nov 27, 2022

szh-max commented Jul 11, 2023

bobbyfyb commented Oct 17, 2024

Distributed computing (eg, multi-GPU) support #13

Distributed computing (eg, multi-GPU) support #13

Comments

erensezener commented Aug 27, 2021

fabrahman commented Nov 1, 2021

fabrahman commented Nov 2, 2021

fabrahman commented Nov 2, 2021

gowtham1997 commented Sep 14, 2022

Duemoo commented Nov 27, 2022

szh-max commented Jul 11, 2023

bobbyfyb commented Oct 17, 2024