Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedEvalSamper hangs at the end of the script when using DDP #44

Closed
vaseline555 opened this issue Apr 26, 2022 · 5 comments
Closed

Comments

@vaseline555
Copy link

Dear author,
Thank you at first for your great work!

I am trying to use your implementation of DistributedEvalSampler for an evaluation purpose, jointly with DDP.
(with shuffle=Flase and no calling of set_epoch(); after calling DistributedEvalSampler for yielding test samples on evaluating a model, my program should be finished)

At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs.
(the last device is soley terminated with no errors)
When replaced with DistributedSampler, this is not occurred.

I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device,
but it is not the root cause as it is still occurred when I turned off the logging tool.

I wonder if you could point out conditions that I missed, please?
Thank you in advance.

Best,
Adam

@SeungjunNah
Copy link
Owner

Hi @vaseline555,

  1. Is your dataset size divisible by the number of GPUs?
    If so, there should be no difference in the behavior of DistributedSampler and DistributedEvalSampler.

  2. Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation?
    DistributedEvalSampler does not require any communication between processes and I don't think it will be the source of hanging.
    If you are using other synchronization-based operations, they may expect the same dataset length per process.
    For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration.
    If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur.
    When I need to do backpropagation at test time for each item, I turn off synchronization.

self.model.model.G.require_backward_grad_sync = False   # compute without DDP sync

Best,
Seungjun

@vaseline555
Copy link
Author

Dear @SeungjunNah,

Thank you for your detailed answers.
Like you presumed, it is exactly the case 2 that I have faced: uneven inputs are provided across different ranks.

Though there's no typical synchronization operation like backward() except .item() or .detach().cpu(),
the main problem is the position I called torch.distributed.barrier()...

I called it at the end of every iteration, not the end of every epoch.
Thus, when inputs of the rank having less inputs are depleted (which surely has less iterations than others),
it escapes out the evaluation loop faster than others, thereby other ranks are hanging around the barrier...

I fixed it by replacing the barrier to other position (i.e., at the end of epoch), and now things are going well.
While Googling, many people have trouble with treating uneven inputs when using DDP.
(FYI: pytorch/pytorch#38174; Lightning-AI/pytorch-lightning#3325; pytorch/pytorch#72423), even though I tried using DDP.join() context manager, yours finally worked as a solution. 👍
I would like to thank you again for sharing your implementation of DistributedEvalSampler.

Have a nice day!
Thank you.

Sincerely,
Adam

@DaoD
Copy link

DaoD commented Jul 6, 2022

How can I use DistributedEvalSampler when I have to use dist.all_gather() to collect results? Many thx!

@SeungjunNah
Copy link
Owner

@DaoD
I don't know where you want to call all_gather but I do all_reduce outside the loop.
In my case, all processes are independent and the communications are done after the loop to collect loss/metric statistics.

In train.py, I compute loss/metrics from the outputs here.

self.criterion(output, target)

Outside the loop, here, I call

self.criterion.normalize()

which is defined here with dist.all_reduce inside.

If you want to call all_gather during the for loop, I think it will hang.
But then, that will be the case you need all processes to work together and that's not an expected use case of DistributedEvalSampler.

@DaoD
Copy link

DaoD commented Jul 7, 2022

@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants