Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes [SUREFIRE-1516]: Poor performance in reuseForks=false #253

Closed
wants to merge 3 commits into from

Conversation

jon-bell
Copy link
Contributor

Hi,
This PR resolves the performance bug noted in SUREFIRE-1516, which appears when using the reuseForks=false configuration. The root-cause of the observed performance problem comes from forked JVM teardown time.

The issue is that the forked JVM should not block reading IO to read more from the host JVM after it sends BYE_ACK. Threads blocking on read may not be interruptable until they poll for interrupts (every 350msec for stdin), which can introduce significant latency for short-lived processes, such as surefire forked JVMs which are running just a single test at a time. This 350msec overhead can add up on projects that have thousands of test classes, each of which might take only several hundred msec to run.

To measure the scope of the problem and confirm its resolution, I created a simple benchmark test suite, which consists of 100 JUnit test classes, each with a single test that calls Thread.sleep(250). I instrumented the JVM to record the time that each JVM starts, the time that the test starts (as measured by JUnit), the time that the test ends (as measured by JUnit), and the time that the JVM terminates. For comparison, I also did the same experiment with ant and gradle.

The table below shows the results, which represent the average time for each test (over the 100 samples):

Configuration Time to start forked JVM Time to run test Time to tear down forked JVM
ant 1.10.6 250.42 252.81 8.75
gradle 5.6.1 394.91 253.12 16.9
mvn (b97df5a) 250.21 252.59 358.59
mvn (2fbe44f) 216.66 252.32 16.9

Overall, most build systems took similar amounts of time to spin up the JVM, and all took the expected 250ms to run each test. You can see that the current master version of surefire (b97df5a) takes an unusually high amount of time to tear down the forked JVM (in fact, 350 msec more, which is exactly the time for the JVM to interrupt a thread reading from stdin explained in this fantastic Stack Overflow post). This is an easy fix though: after receiving the BYE_ACK message, the forked JVM can stop reading commands from the main surefire process, since it's shutting down. After implementing this fix, the overhead goes away (as shown in 2fbe44f).

… from the host JVM after it sends BYE_ACK. Threads blocking on `read` may not be interruptable until they poll for interrupts (every 350msec), which can introduce significant latency for short-lived processes.
@Tibor17
Copy link
Contributor

Tibor17 commented Nov 12, 2019

The build fails on macOS.
ForkModeMultiModuleIT.testForkCountTwoNoReuse:70->doTest:137 » SurefireVerifier
Can you pls have a look?
https://github.com/apache/maven-surefire/pull/253/checks?check_run_id=298330219

@@ -412,6 +412,10 @@ public void run()
CommandReader.this.wakeupIterator();
callListeners( command );
break;
case BYE_ACK:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not very elegant solution.
In the ForkedBooter the listener is already registered commandReader.addByeAckListener( new CommandListener() , see the line 356. See the next comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I do not understand is why this change cause IT failure on macOS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to reproduce it with several runs of the build.

@jon-bell
Copy link
Contributor Author

It looks like several of the tests are flaky, perhaps - when I try to reproduce locally on MacOS, I get a different set of tests failing. Unfortunately I do not have time to debug these tests this week.

@jon-bell
Copy link
Contributor Author

@Tibor17 can you trigger the Mac OS build to run again? I just tried to run the test suite locally again on my Mac and now none of the integration tests failed. I think that there is definitely a flaky test here (and the failure is not due to my change)

@jon-bell
Copy link
Contributor Author

@Tibor17 please note that all of the tests are now passing.

@Tibor17
Copy link
Contributor

Tibor17 commented Dec 17, 2019

@jon-bell I have triggered the build again.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice trick.
Thank you for providing this enhancement

I hope we can commit this patch soon, I left my comment

@Tibor17
Copy link
Contributor

Tibor17 commented Dec 17, 2019

The build failed again https://github.com/apache/maven-surefire/runs/352389395
@jon-bell What happens if you push a new dummy commit several times an hour? Pls try and we will investigate them then. Thx

@jon-bell
Copy link
Contributor Author

@Tibor17 I made a dummy commit and it looked like it passed - 06c74da

@Tibor17
Copy link
Contributor

Tibor17 commented Dec 18, 2019

@jon-bell
ok, let's continue with a new commit for #253 (comment)
and then i will create a branch to run the build on our Jenkins.

@Tibor17
Copy link
Contributor

Tibor17 commented Dec 29, 2019

@jon-bell Thx for contributing. I used your changes, updated the IT 705 and closed the Jira as fixed.
Can you pls perform new measurements with master, how far we are from the last measurement?

@Tibor17 Tibor17 closed this Dec 29, 2019
@Tibor17 Tibor17 reopened this Dec 29, 2019
@Tibor17 Tibor17 closed this Dec 29, 2019
@jon-bell
Copy link
Contributor Author

@Tibor17 - Thanks! The numbers from current master (5148b02) look pretty much the same on my side (down to 16.9msec to tear down the process).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants