You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:
TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.
Resource usage summary:
CPU time : 1294.38 sec.
Max Memory : 66465 MB
Average Memory : 3062.34 MB
Total Requested Memory : 256000.00 MB
Delta Memory : 189535.00 MB
Max Swap : -
Max Processes : 35
Max Threads : 2482
Run time : 482 sec.
Turnaround time : 526 sec.
The output (if any) is above this job summary.`
This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone
The text was updated successfully, but these errors were encountered:
Hi all,
essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:
`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir
TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.
Resource usage summary:
The output (if any) is above this job summary.`
This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone
The text was updated successfully, but these errors were encountered: