-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: avoid test-cluster-master-kill flakiness #6531
Conversation
Removed reliance on worker exit before arbitrary timeout. Instead of killing the parent process, destroy the IPC pipe for the same effect, and wait for the worker's exit before also exiting the parent. Insuring these steps are well-ordered removes the need for timeouts and reduces intermittent failures. In case of an actual hang in the child's exit code, the test harness global timeout will kick in, and the test will still fail.
/* Cluster.disconnect() will exercise a different 'graceful' shutdown path. | ||
From the perspective of the worker, closing the channel is equivalent | ||
to the parent calling process.exit(0). */ | ||
worker.process._channel.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't object, but this does use undocumented APIs, and the last test didn't. That can be annoying when trying to rearrange things that are supposed to be internal.
Why not just change the check at the end to not do a single check after some randomly chosen time interval, but to run the check every half-second, indefinitely, until it passes or the runner kills this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-github That would be another way to do this - I'm not a huge fan of polling but it may be the lesser evil. I can do another version with your suggestion. Thanks.
@sam-github I've adjusted the test as you suggested to simply loop indefinitely. |
Maybe it makes sense applying the same changes to |
@santigimeno Thanks for pointing that out - I've updated |
LGTM. |
Duplicate of #5056? |
var pollWorker = function() { | ||
alive = isAlive(pid); | ||
if (alive) { | ||
setTimeout(pollWorker, 500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you safely can set the timeout value lower, so we don´t slow down the test up to a half second.
@ncopa Updated with 50ms polling interval instead of 500ms. Thanks. |
Nice analysis. I too is much more comfortable about polling than just closing the IPC. I remember there have been issues in the past where the disconnect event did not fire when the parent died. But it's a long time ago, I could be wrong. LGTM |
One more CI run with the last change: https://ci.nodejs.org/job/node-test-pull-request/2496/ @ncopa After the CI run I'll merge this PR and close #5056 as they are largely equivalent. |
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Landed as fc66e55. |
I wonder if this will fix #6193 as well. |
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@stefanmb lts? |
@thealphanerd I think this is OK for LTS. Thanks for reminding me. |
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Checklist
Affected core subsystem(s)
test, cluster
Description of change
I've observed that test-cluster-master-kill fails intermittently on an AIX 6.1 machine (oslevel 6100-07-08-1339) due to timeout before worker termination. There was a previous PR (nodejs/node-v0.x-archive#9431) which arbitrarily increased the timeout for AIX to 1 second, however this value is still a guess and appears to be insufficient. It's also worth noting the arbitrary timeouts have also caused problems for other platforms, see #2891 (comment). Arbitrary timeouts cannot compensate for external factors such as system load.
In this PR I propose removing the timeout mechanism entirely, here is how the test currently works:
(*) Without this mechanism the worker would become an orphan child of init.
Step 6 is inherently flaky. The test was originally added as part of nodejs/node-v0.x-archive@94d337e#diff-0faa53fc02580d5de2ebb484c41d691cR498 where it specifically tested the new disconnect pathway.
The test boils down to the following actions:
Since step 2 does not actually require step 1, I propose the following alternate flow:
With this setup there is no need for the arbitrary wait time. The obvious problem is if the child never exits the test will hang - however in that case the test will still be killed by the test harness's global timeout.
I do believe a timeout mechanism is useful for detecting liveness issues, but the presence of arbitrary timeouts in the tests themselves should be minimized, the single timeout in the test harness suffices.
Any comments are appreciated, especially from @AndreasMadsen.