Skip to content
This repository has been archived by the owner on Feb 1, 2025. It is now read-only.

Distributed computing (eg, multi-GPU) support #13

Open
erensezener opened this issue Aug 27, 2021 · 7 comments
Open

Distributed computing (eg, multi-GPU) support #13

erensezener opened this issue Aug 27, 2021 · 7 comments

Comments

@erensezener
Copy link

I see that there is some code supporting multi-GPUs, eg here and here.

However, I don't see an option/flag to actually utilize distributed computing. Could you clarify?

Thank you.

@fabrahman
Copy link

@erensezener Did you figure this out? :)

@fabrahman
Copy link

@gizacard Would you mind providing some instruction on this? which options should be set? Thanks

@fabrahman
Copy link

@gizacard I wanted to train using Multi-GPU (4 gpus) , and for that I used the local_rank=0, and set following env variables:

RANK=0
NGPU=4
WORLD_SIZE=4

Although I am not aiming for slurs job the code here require me to set MASTER_ADDR and MASTER_PORT as well. why? Anyway I set them to be my server ip and a port.

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

Can you guide me if I am doing correct? Thanks

@gowtham1997
Copy link

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

@Duemoo
Copy link

Duemoo commented Nov 27, 2022

@fabrahman Could you provide an update on this issue? I have exactly the same issue, and I found that the code freezes without any error message after executing line 194 of train_reader.py

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

@szh-max
Copy link

szh-max commented Jul 11, 2023

@Duemoo I also encountered this problem, using multiple gpu, I found that the code freezes without any error message after executing line 194 of train_reader.py
how did you solve it? Thanks!

@bobbyfyb
Copy link

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

@szh-max Hi I also encountered this problem and I solved it by updating torch version to torch==1.10.0 and export NCCL_P2P_DISABLE=1 from here:
https://discuss.pytorch.org/t/distributed-data-parallel-freezes-without-error-message/8009/29

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants