Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dist barrier instead of sleep with multi gpu test #666

Merged
merged 3 commits into from
Mar 4, 2022
Merged

dist barrier instead of sleep with multi gpu test #666

merged 3 commits into from
Mar 4, 2022

Conversation

twmht
Copy link
Contributor

@twmht twmht commented Jan 20, 2022

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

I found out in my case sleep(2) is not enough, tmpdir would be created before checking the existence.

the safe way would be dist.barrier.

Modification

change time.sleep(2) to dist.barrier

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
  • CLA has been signed and all committers have signed the CLA in this PR.

@CLAassistant
Copy link

CLAassistant commented Jan 20, 2022

CLA assistant check
All committers have signed the CLA.

@mzr1996
Copy link
Member

mzr1996 commented Jan 20, 2022

Thanks for your contribution, we will test it later.

@@ -99,7 +99,7 @@ def multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=False):
' Since tmpdir will be deleted after testing,',
' please make sure you specify an empty one.'))
prog_bar = mmcv.ProgressBar(len(dataset))
time.sleep(2) # This line can prevent deadlock problem in some cases.
dist.barrier()
Copy link
Member

@mzr1996 mzr1996 Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I have checked the problem.
Here, the sleep(2) is a method to prevent deadlock during switching dataloader. And according to our experience, dist.barrier is not enough to solve it, which refers to open-mmlab/mmcv#1640
However, your problem is not about the deadlock problem. It's caused by the directory checking in the rank 0. And the dist.barrier is necessary there.
in conclusion , it's better to reserve both lines.

Suggested change
dist.barrier()
time.sleep(2)
dist.barrier()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. I would update the PR later. thank you.

add time.sleep before dist.barrier
@twmht
Copy link
Contributor Author

twmht commented Feb 25, 2022

@mzr1996

I have updated. please have a look.

@codecov
Copy link

codecov bot commented Mar 4, 2022

Codecov Report

Merging #666 (725a6ac) into master (a7f8e96) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #666      +/-   ##
==========================================
- Coverage   83.15%   83.14%   -0.02%     
==========================================
  Files         126      126              
  Lines        7630     7631       +1     
  Branches     1332     1332              
==========================================
  Hits         6345     6345              
- Misses       1095     1096       +1     
  Partials      190      190              
Flag Coverage Δ
unittests 83.14% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/apis/test.py 23.93% <0.00%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7f8e96...725a6ac. Read the comment docs.

@mzr1996 mzr1996 merged commit 701b426 into open-mmlab:master Mar 4, 2022
mzr1996 pushed a commit to mzr1996/mmpretrain that referenced this pull request Nov 24, 2022
…lab#666)

* dist barrier instead of sleep

* Update test.py

add time.sleep before dist.barrier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants