-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run CI unittests in parallel #3445
Conversation
This achieves the following speed-ups
While macOS and Windows test are running much faster with parallel tests, test on Linux for Python3[6-8] are much slower. Since the tests for Linux and Python3.9 are also a lot faster, I suspect this is skipping some tests that slow down the overall execution significantly. I'll investigate. |
By fixing |
Codecov Report
@@ Coverage Diff @@
## master #3445 +/- ##
=======================================
Coverage 76.00% 76.00%
=======================================
Files 105 105
Lines 9697 9697
Branches 1556 1556
=======================================
Hits 7370 7370
Misses 1841 1841
Partials 486 486 Continue to review full report at Codecov.
|
That was not the problem. The slowdown occurred only in test that made use of To fix this we can simply set
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great to me, and the CI speedups are amazing!
I'd love to get @seemethere eyes on this as well in case I'm missing something
I used to do this in torchaudio, but removed it. The reason is that when one of the process behave abnormally (segfault / hanging etc) there was no log that I could look at from the browser, and I had to first disable the xdist to debug what was going on, and that was more time consuming. So, be prepared if you proceed with this. |
The thread over-subcription issue observed in #3445 (comment) is quite typical and it's likely to happen in other places. FWIW, we use |
Can you find an old CI run where this came up? I currently can't really picture the problem. |
It's been months so I cannot find the log, but here is the question, while tests are being executed, can you see what tests are currently being executed? I see that in the CI logs of this PR, we can see the which test has passed/skipped/failed, but the question is when the being ran. Even if the log is updated as the test runner moves on, if it is only showing the completed tests, then you do not know which test is being executed now. In this case, if any of the test hangs, you do not see which test hangs in the log. In such case, eventually CI system will timeout and kill these jobs but the resulting log won't show which test caused the timeout, and one needs to debug it and he/she has to start from where disabling xdist. If the log shows which test exhibited abnormal behavior/termination, that's good, but if not, it will be hard for other maintainers to look into the cause. |
We may need to increase the no-output-timeout for conda builds since the conda dependency resolver is extremely slow |
@mthrok There is See this for a sample output with tests that timeout. CircleCI also seems to recognize @fmassa Given the valid concerns of @mthrok I would additionally add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome changes @pmeier.
Only a couple of notes from me for future reference:
- Running tests in parallel will cause them to execute in unpredictable order. Given we don't set the seed at the beginning of every test/class, we might see some flakiness on the future. We've observed similar issues in the past, but I don't think that's a reason not to merge this.
- Follow up PRs might be worth adding more control on which classes/tests should be parallelized and which should not.
You may consider switching to mamba if it's getting to the "this takes minutes" level? Can be many times faster in the resolve phase. |
So you are saying that some tests are not independent of the others? If that is the case we should fix this ASAP. Since you mentioned seeding, do we have tests that rely on a specific random seed? If so, doesn't this mean that either our method of testing is not adequate or our code actually contains bugs that happen for some inputs?
What would be a reason to run a test not in parallel? The only thing I can think of are GPU tests that overflow the GPU memory. This is why I spared the parallelization for GPU tests for now. Other than that I can't think of another reason. |
To follow up on what @rgommers said:
I'll work on that when this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's give this a try
There is one failing test. This is most likely due to to tight tolerances. I'll fix this in a follow-up PR. @datumbox is this one of the flaky tests you mentioned before? |
Reverting this as some tests are now broken |
The tests are in principle independent but some of them exhibit some flakiness and it can happen that running the tests in different order can make us hit a "bad seed". See this old example: #3032 (comment)
Yes that's what I had in mind as well. Parallelizing GPU tests for models should probably be avoided to avoid memory issues. On the other hand, running GPU tests related to transformations in parallel should be OK. Hence having some control over what's parallelized would be useful. |
Summary: * enable parallel tests * disable parallelism for GPU tests * [test] limit maximum processes on linux * [debug] limit max processes even further * [test] use subprocesses over threads * [test] limit intra-op threads * only limit intra op threads for CPU tests * [poc] use low timeout for showcasing * [poc] fix syntax * set timeout to 5 minutes * fix timeout on windows Reviewed By: fmassa Differential Revision: D26756257 fbshipit-source-id: f2fc4753a67a1505f01116119926eec365693ab9 Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
* enable parallel tests * disable parallelism for GPU tests * [test] limit maximum processes on linux * [debug] limit max processes even further * [test] use subprocesses over threads * [test] limit intra-op threads * only limit intra op threads for CPU tests * [poc] use low timeout for showcasing * [poc] fix syntax * set timeout to 5 minutes * fix timeout on windows Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Since our CI unittest machines are pretty beefy, we might be able to reduce the wall time significantly by running the tests in parallel.
Note that while this uses the
pytest-xdist
plugin this not makes our tests dependent onpytest
. They can still be run byunittest
.