-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DistributedEvalSamper hangs at the end of the script when using DDP #44
Comments
Hi @vaseline555,
self.model.model.G.require_backward_grad_sync = False # compute without DDP sync Best, |
Dear @SeungjunNah, Thank you for your detailed answers. Though there's no typical synchronization operation like I called it at the end of every iteration, not the end of every epoch. I fixed it by replacing the barrier to other position (i.e., at the end of epoch), and now things are going well. Have a nice day! Sincerely, |
How can I use DistributedEvalSampler when I have to use dist.all_gather() to collect results? Many thx! |
@DaoD In train.py, I compute loss/metrics from the outputs here. self.criterion(output, target) Outside the loop, here, I call self.criterion.normalize() which is defined here with dist.all_reduce inside. If you want to call |
@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop. |
Dear author,
Thank you at first for your great work!
I am trying to use your implementation of
DistributedEvalSampler
for an evaluation purpose, jointly with DDP.(with
shuffle=Flase
and no calling ofset_epoch()
; after callingDistributedEvalSampler
for yielding test samples on evaluating a model, my program should be finished)At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs.
(the last device is soley terminated with no errors)
When replaced with
DistributedSampler
, this is not occurred.I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device,
but it is not the root cause as it is still occurred when I turned off the logging tool.
I wonder if you could point out conditions that I missed, please?
Thank you in advance.
Best,
Adam
The text was updated successfully, but these errors were encountered: