Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with training on a single node #101

Open
andics opened this issue Nov 19, 2023 · 0 comments
Open

Issues with training on a single node #101

andics opened this issue Nov 19, 2023 · 0 comments

Comments

@andics
Copy link

andics commented Nov 19, 2023

Hi all,

essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:

`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir

TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.

Resource usage summary:

CPU time :                                   1294.38 sec.
Max Memory :                                 66465 MB
Average Memory :                             3062.34 MB
Total Requested Memory :                     256000.00 MB
Delta Memory :                               189535.00 MB
Max Swap :                                   -
Max Processes :                              35
Max Threads :                                2482
Run time :                                   482 sec.
Turnaround time :                            526 sec.

The output (if any) is above this job summary.`

This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant