-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding progress bar #4225
Comments
@Limtle could you please share a complete example to reproduce? Does it happen only on DDP? Are all 16 updated in parallel or just the last one? |
Sorry . the code can't be shared. But the following is my log file.
|
@Borda I have another example from pytorch_lightning . use 1 node (8 gpus) also have same situation. environment Code
Slurm
Log
|
@Borda How to check all 16 updated in parallel or just the last one? |
closed #4437 |
When training or Validating on 2 nodes (8 gpus per node), Lightning show 16 same progress bars with different loss, such like
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.122, v_num=193413].
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=9.858, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.225, v_num=193413]
...
It means the output have 16 progress bars if training on 16 GPUs. I suppose that samples used for training are different on each GPU therefore leads to different progress bar. Moreover, the similar situation also show up in validation. I am wondering if different samples are distributed to each GPU during validation.
I use Dataset for training and IterableDataset for validation.
The text was updated successfully, but these errors were encountered: