-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: fix race condition in unrefd interval test #3550
Conversation
Before this commit, test-timers-unrefd-interval-still-fire required a 1ms interval to fire 5 times before a 100ms timeout to pass which could cause intermittent failures in CI and in issue #1781. This commit gives the test up to 5x as long to complete while allowing the test to complete as quickly as before when possible.
I'm not too fond of this way of doing things. I suggest re-writing the test to be more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js Particularly, it is more reliable to depend on the order of timer execution rather than arbitrary timeouts. A master timeout is, however, necessary to keep the process open. That being said, the last unrefed timeout should be able to tell when it has succeed and cancel the master timeout. |
i.e. setTimeout(b, 2)
setTimeout(a, 1)
|
Actually, that exact order may not be correct for "regular" timers, in my test it applied to |
I made some changes to look more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js. Mainly so that the master timeout is cleared as soon as the last unrefd interval fires. |
|
||
var N = 5; | ||
const TEST_DURATION = common.platformTimeout(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500 seems awfuly large with scaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What number is reasonable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100
, although really you could probably get away with something smaller. If this is taking more than 50 or so ms
you've got some big issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like 100ms was causing the intermittent failure in CI in the first place, unless that can be handled in platformTimeout.
cc @mhdawson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, looks like we didn't do that for AIX yet? https://github.com/nodejs/node/blob/master/test/common.js#L232-L240
In this case I suggest setting the timeout to 500 only if common.isAix
is true
.
@mhdawson what are the sort of specs of an AIX machine? I am highly curious how 5 1ms timers could take more than 100ms. |
I think I may have gotten it wrong and its zLinux. as opposed to AIX. I think the main issue is that we have a good number of VMs sharing the same hardware. Its not that it fails all the time only that we see intermittent failures. @thinkingdust can you talk to Joran and provide more details if you have them. |
In terms of platform timeout I don't want to extend the length of all AIX tests because we see intermittent failures on a few tests. |
The original failure was on zLinux so a platform specific change won't do. |
LGTM if I'd really like to know if @joransiu or anyone else has any insight as to how that could be happening though. |
I think the failure was caused by other processes running on the CI machine. I can replicate the failure with the bash script below. The failure rate can be controlled by changing the number of background processes.
|
@thinkingdust Any idea how many processor-level threads that machine has? Like, are you literally just clogging everything up that much that the timer doesn't get run? |
I had suggested a few experiments with @thinkingdust to get a better gauge on various timeout values for our Linux on Z configuration. The machine we are running on is a guest Linux system running with 8 CPs on zEC12 @ 5.5GHz => 8 processor level threads available. With relative little load on the system, our very informal measurements:
Now, if we ran 8 instances in parallel... @ 100ms timeout, we observe a failure rate of about 1%. I'd say given these stats, 100ms is probably reasonable. |
Keep open timeout has been changed back to 100ms. |
var timer = setInterval(function() { | ||
|
||
const keepOpen = setTimeout(() => { | ||
assert.strictEqual(nbIntervalFired, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more informative to just throw like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js#L14
That should never fail this. Is it possible libuv isn't scheduling these efficiently on AIX? |
LGTM minus the comment. |
I made the change to throw the same error as https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js#L14 I think it was slightly useful to print nbIntervalFired to see if the interval callback ran at all. Is it a good idea to include that? |
@thinkingdust sure. |
@Fishrock123 Is this format okay? |
Was going to land squashed as
@thinkingdust But I think I actually missed something, sorry, we should harden the test a bit more. |
clearInterval(timer); | ||
clearTimeout(keepOpen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In here, can you set timer._onTimeout
to a function that throws an error if it called again?
(That will ensure it does not get called more than 5 times.)
@Fishrock123 Any more suggestions for this PR? |
Rely more on timers implementation rather than arbitrary timeouts. Refs: nodejs#1781 PR-URL: nodejs#3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Thanks, landed in 6de82c6 |
(Squashed and updated description) |
I don't think I understand why 6de82c6 would fix the race condition and make the test fail less often in the situation described in the first comment of this PR. Am I missing something? |
Ugh, guess I should have updated the description. It doesn't fix it per se, but it does refactor the test to work in a more reliable fashion. Ultimately if your 5 1ms timers aren't firing in under like 25ms you've probably got serious hardware problems, or armv6. |
One way to fix the flakiness of this test would be to rely on the tests suite's per-test timeout (which is currently set to 60 seconds) and instead of adding a timer to hold the loop open (which would require us to hardcode this 60 seconds tests suite timeout somewhere else, or refactor the tests suite driver so that it can communicate that value to JS tests), we could add another type of handle that holds the loop open indefinitely, like a When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail. But we would get rid of almost all false positives. |
That's what happens now. If 5 1ms timers fail within <100ms there are issues, this isn't just flakiness. As it stands, this patch appears to have corrected the original issue. |
It seems that what happens now is that the test will fail if the 5 timers' callbacks haven't run within 100 ms. According to @mhdawson and @joransiu it is possible that on systems with high load, 5 timers' callbacks take longer than 100ms to run, which seems entirely plausible. That doesn't mean that unrefed timers are broken, but still it seems that this change would make the test fail in these situations. That would make the test flaky. What I was describing is a way to give much more time for these timers to fire, not just 100ms, without making the best cases take longer, and without hardcoding the expiry delay for the handle that keeps the loop open. I believe that would make the test non flaky.
Do you mean that it fixes #1781, or that it fixes the problems reported by @mhdawson, or both? If I'm not missing anything, 6de82c6 has been pushed today, and #1781 was closed on July 28th, so I don't know how we can determine that this change fixed that issue. |
Before this commit, test-timers-unrefd-interval-still-fire required a
1ms interval to fire 5 times before a 100ms timeout to pass which could
cause intermittent failures in CI and in issue #1781.
This commit gives the test up to 5x as long to complete while allowing
the test to complete as quickly as before when possible.
cc @mhdawson