-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dist barrier instead of sleep with multi gpu test #666
Conversation
Thanks for your contribution, we will test it later. |
@@ -99,7 +99,7 @@ def multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=False): | |||
' Since tmpdir will be deleted after testing,', | |||
' please make sure you specify an empty one.')) | |||
prog_bar = mmcv.ProgressBar(len(dataset)) | |||
time.sleep(2) # This line can prevent deadlock problem in some cases. | |||
dist.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I have checked the problem.
Here, the sleep(2)
is a method to prevent deadlock during switching dataloader. And according to our experience, dist.barrier
is not enough to solve it, which refers to open-mmlab/mmcv#1640
However, your problem is not about the deadlock problem. It's caused by the directory checking in the rank 0. And the dist.barrier
is necessary there.
in conclusion , it's better to reserve both lines.
dist.barrier() | |
time.sleep(2) | |
dist.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. I would update the PR later. thank you.
add time.sleep before dist.barrier
merge upstream
I have updated. please have a look. |
Codecov Report
@@ Coverage Diff @@
## master #666 +/- ##
==========================================
- Coverage 83.15% 83.14% -0.02%
==========================================
Files 126 126
Lines 7630 7631 +1
Branches 1332 1332
==========================================
Hits 6345 6345
- Misses 1095 1096 +1
Partials 190 190
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…lab#666) * dist barrier instead of sleep * Update test.py add time.sleep before dist.barrier
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
I found out in my case
sleep(2)
is not enough,tmpdir
would be created before checking the existence.the safe way would be
dist.barrier
.Modification
change
time.sleep(2)
todist.barrier
BC-breaking (Optional)
Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
Checklist
Before PR:
After PR: