test: fix race condition in unrefd interval test #3550

mcornac · 2015-10-27T17:48:49Z

Before this commit, test-timers-unrefd-interval-still-fire required a
1ms interval to fire 5 times before a 100ms timeout to pass which could
cause intermittent failures in CI and in issue #1781.

This commit gives the test up to 5x as long to complete while allowing
the test to complete as quickly as before when possible.
cc @mhdawson

Before this commit, test-timers-unrefd-interval-still-fire required a 1ms interval to fire 5 times before a 100ms timeout to pass which could cause intermittent failures in CI and in issue #1781. This commit gives the test up to 5x as long to complete while allowing the test to complete as quickly as before when possible.

Fishrock123 · 2015-10-27T18:24:21Z

I'm not too fond of this way of doing things.

I suggest re-writing the test to be more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js

Particularly, it is more reliable to depend on the order of timer execution rather than arbitrary timeouts.

A master timeout is, however, necessary to keep the process open. That being said, the last unrefed timeout should be able to tell when it has succeed and cancel the master timeout.

Fishrock123 · 2015-10-27T18:26:28Z

i.e.

setTimeout(b, 2)
setTimeout(a, 1)

b will always be called last. You must register the longer timers first.

Fishrock123 · 2015-10-27T18:34:33Z

Actually, that exact order may not be correct for "regular" timers, in my test it applied to _unrefActive, which is... a little different. I'll look more in detail soon.

mcornac · 2015-10-28T18:11:33Z

I made some changes to look more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js. Mainly so that the master timeout is cleared as soon as the last unrefd interval fires.

Fishrock123 · 2015-10-28T18:44:25Z

test/parallel/test-timers-unrefd-interval-still-fires.js


-var N = 5;
+const TEST_DURATION = common.platformTimeout(500);


500 seems awfuly large with scaling.

What number is reasonable here?

100, although really you could probably get away with something smaller. If this is taking more than 50 or so ms you've got some big issues.

It seems like 100ms was causing the intermittent failure in CI in the first place, unless that can be handled in platformTimeout.
cc @mhdawson

Hmmm, looks like we didn't do that for AIX yet? https://github.com/nodejs/node/blob/master/test/common.js#L232-L240

In this case I suggest setting the timeout to 500 only if common.isAix is true.

Fishrock123 · 2015-10-28T22:01:12Z

@mhdawson what are the sort of specs of an AIX machine? I am highly curious how 5 1ms timers could take more than 100ms.

mhdawson · 2015-10-30T03:18:53Z

I think I may have gotten it wrong and its zLinux. as opposed to AIX. I think the main issue is that we have a good number of VMs sharing the same hardware. Its not that it fails all the time only that we see intermittent failures. @thinkingdust can you talk to Joran and provide more details if you have them.

mhdawson · 2015-10-30T03:21:02Z

In terms of platform timeout I don't want to extend the length of all AIX tests because we see intermittent failures on a few tests.

mcornac · 2015-11-02T16:45:19Z

The original failure was on zLinux so a platform specific change won't do.
With the updates, the test now clears the master timeout on success so its length will only matter when deciding that the test has failed.
I don't think the failure has repeated so I think we could keep the original 100ms.
/cc @joransiu

Fishrock123 · 2015-11-04T15:47:14Z

LGTM if common.platformTimeout(500); is 100 then.

I'd really like to know if @joransiu or anyone else has any insight as to how that could be happening though.

mcornac · 2015-11-04T16:09:40Z

I think the failure was caused by other processes running on the CI machine. I can replicate the failure with the bash script below. The failure rate can be controlled by changing the number of background processes.

#!/bin/bash

NUM_PROCESSES=32
for j in `seq 1 $NUM_PROCESSES`;
do
  k=0
  while [ $k -lt 1000000 ]
  do
    ((k++))
  done &
done

pass=0
fail=0
output=""
for i in `seq 1 100`;
do
  output=$(./node test/parallel/test-timers-unrefd-interval-still-fires.js 2>&1)
  len=${#output}
  if [ $len -eq 0 ]; then
    ((pass++))
  else
    ((fail++))
  fi
  if [ $i -ne 1 ]; then
    echo -en "\e[1A"
  fi
  echo pass $pass : fail $fail
done
wait

Fishrock123 · 2015-11-04T16:14:44Z

@thinkingdust Any idea how many processor-level threads that machine has?

Like, are you literally just clogging everything up that much that the timer doesn't get run?

joransiu · 2015-11-04T19:19:19Z

I had suggested a few experiments with @thinkingdust to get a better gauge on various timeout values for our Linux on Z configuration. The machine we are running on is a guest Linux system running with 8 CPs on zEC12 @ 5.5GHz => 8 processor level threads available.

With relative little load on the system, our very informal measurements:

Timeout	Failure Rate
20ms	16/500 (~3%)
30ms	4 / 500 (~1%)
100ms	0 / 500 (0%)

Now, if we ran 8 instances in parallel... @ 100ms timeout, we observe a failure rate of about 1%. I'd say given these stats, 100ms is probably reasonable.

mcornac · 2015-11-04T20:51:26Z

Keep open timeout has been changed back to 100ms.

Fishrock123 · 2015-11-09T18:17:03Z

test/parallel/test-timers-unrefd-interval-still-fires.js

-var timer = setInterval(function() {
+
+const keepOpen = setTimeout(() => {
+  assert.strictEqual(nbIntervalFired, N);


It's probably more informative to just throw like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js#L14

Fishrock123 · 2015-11-09T18:19:12Z

8 CPs on zEC12 @ 5.5GHz => 8 processor level threads available.

That should never fail this.

Is it possible libuv isn't scheduling these efficiently on AIX?

Fishrock123 · 2015-11-09T18:20:50Z

LGTM minus the comment.

mcornac · 2015-11-09T18:45:50Z

I made the change to throw the same error as https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js#L14

I think it was slightly useful to print nbIntervalFired to see if the interval callback ran at all. Is it a good idea to include that?

Fishrock123 · 2015-11-09T18:52:55Z

@thinkingdust sure.

mcornac · 2015-11-09T19:15:59Z

@Fishrock123 Is this format okay?

Fishrock123 · 2015-11-09T19:49:10Z

Was going to land squashed as

    test: fix race condition in unrefd interval test

    The test now relies more on timers implementation rather than arbitrary
    timeouts, and will cancel a master timeout once complete.

    Refs: https://github.com/nodejs/node/issues/1781
    PR-URL: https://github.com/nodejs/node/pull/3550
    Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
    Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

@thinkingdust But I think I actually missed something, sorry, we should harden the test a bit more.

Fishrock123 · 2015-11-09T19:57:32Z

test/parallel/test-timers-unrefd-interval-still-fires.js

    clearInterval(timer);
+    clearTimeout(keepOpen);


In here, can you set timer._onTimeout to a function that throws an error if it called again?

(That will ensure it does not get called more than 5 times.)

mcornac · 2015-11-16T21:02:54Z

@Fishrock123 Any more suggestions for this PR?

Fishrock123 · 2015-11-19T18:41:16Z

CI: https://ci.nodejs.org/job/node-test-pull-request/789/

Rely more on timers implementation rather than arbitrary timeouts. Refs: nodejs#1781 PR-URL: nodejs#3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

Fishrock123 · 2015-11-20T17:35:13Z

Thanks, landed in 6de82c6

Fishrock123 · 2015-11-20T17:35:33Z

(Squashed and updated description)

misterdjules · 2015-11-20T18:09:36Z

I don't think I understand why 6de82c6 would fix the race condition and make the test fail less often in the situation described in the first comment of this PR. Am I missing something?

Fishrock123 · 2015-11-20T19:00:41Z

Ugh, guess I should have updated the description.

It doesn't fix it per se, but it does refactor the test to work in a more reliable fashion.

Ultimately if your 5 1ms timers aren't firing in under like 25ms you've probably got serious hardware problems, or armv6.

misterdjules · 2015-11-20T19:50:20Z

One way to fix the flakiness of this test would be to rely on the tests suite's per-test timeout (which is currently set to 60 seconds) and instead of adding a timer to hold the loop open (which would require us to hardcode this 60 seconds tests suite timeout somewhere else, or refactor the tests suite driver so that it can communicate that value to JS tests), we could add another type of handle that holds the loop open indefinitely, like a net.Server instance.

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail. But we would get rid of almost all false positives.

Fishrock123 · 2015-11-20T20:38:39Z

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail.

That's what happens now.

If 5 1ms timers fail within <100ms there are issues, this isn't just flakiness.

As it stands, this patch appears to have corrected the original issue.

misterdjules · 2015-11-20T21:01:49Z

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail.

That's what happens now.

It seems that what happens now is that the test will fail if the 5 timers' callbacks haven't run within 100 ms. According to @mhdawson and @joransiu it is possible that on systems with high load, 5 timers' callbacks take longer than 100ms to run, which seems entirely plausible. That doesn't mean that unrefed timers are broken, but still it seems that this change would make the test fail in these situations. That would make the test flaky.

What I was describing is a way to give much more time for these timers to fire, not just 100ms, without making the best cases take longer, and without hardcoding the expiry delay for the handle that keeps the loop open. I believe that would make the test non flaky.

As it stands, this patch appears to have corrected the original issue.

Do you mean that it fixes #1781, or that it fixes the problems reported by @mhdawson, or both?

If I'm not missing anything, 6de82c6 has been pushed today, and #1781 was closed on July 28th, so I don't know how we can determine that this change fixed that issue.

Rely more on timers implementation rather than arbitrary timeouts. Refs: #1781 PR-URL: #3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

Fishrock123 added timers test labels Oct 27, 2015

Fishrock123 self-assigned this Oct 28, 2015

Fishrock123 reviewed Oct 28, 2015
View reviewed changes

jasnell added the lts-watch-v4.x label Nov 3, 2015

Fishrock123 reviewed Nov 9, 2015
View reviewed changes

test: clear timeout on last unrefd interval fire

06d1980

Fishrock123 closed this Nov 20, 2015

Fishrock123 mentioned this pull request Nov 26, 2015

deps: upgrade npm to 3.5.0 #4032

Closed

rvagg pushed a commit that referenced this pull request Dec 5, 2015

test: fix race condition in unrefd interval test

788541b

Rely more on timers implementation rather than arbitrary timeouts. Refs: #1781 PR-URL: #3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

This was referenced Dec 8, 2015

Release proposal: v5.2.0 #4180

Closed

Release proposal: v5.2.0 (Stable) #4181

Merged

rvagg mentioned this pull request Dec 17, 2015

Release 4.2.4 Planning #4321

Closed

MylesBorins pushed a commit that referenced this pull request Dec 29, 2015

test: fix race condition in unrefd interval test

15ae55e

Rely more on timers implementation rather than arbitrary timeouts. Refs: #1781 PR-URL: #3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

MylesBorins added land-on-v4.x and removed lts-watch-v4.x labels Dec 29, 2015

This was referenced Jan 7, 2016

test: fix test-timers-unrefd-interval-still-fires #4561

Closed

test: test-timers-unrefd-interval-still-fires.js flaky on smartOS #4559

Closed

MylesBorins pushed a commit that referenced this pull request Jan 19, 2016

test: fix race condition in unrefd interval test

166523d

Rely more on timers implementation rather than arbitrary timeouts. Refs: #1781 PR-URL: #3550 Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

MylesBorins mentioned this pull request Jan 19, 2016

V4.2.5 proposal #4768

Merged

B020239 mentioned this pull request Mar 9, 2024

[Snyk] Upgrade esbuild from 0.19.5 to 0.20.0 B020239/node#1

Open


		var N = 5;
		const TEST_DURATION = common.platformTimeout(500);

test: fix race condition in unrefd interval test #3550

test: fix race condition in unrefd interval test #3550

Conversation

mcornac commented Oct 27, 2015

Fishrock123 commented Oct 27, 2015

Fishrock123 commented Oct 27, 2015

Fishrock123 commented Oct 27, 2015

mcornac commented Oct 28, 2015

Fishrock123 Oct 28, 2015

Choose a reason for hiding this comment

mcornac Oct 28, 2015

Choose a reason for hiding this comment

Fishrock123 Oct 28, 2015

Choose a reason for hiding this comment

mcornac Oct 28, 2015

Choose a reason for hiding this comment

Fishrock123 Oct 28, 2015

Choose a reason for hiding this comment

Fishrock123 commented Oct 28, 2015

mhdawson commented Oct 30, 2015

mhdawson commented Oct 30, 2015

mcornac commented Nov 2, 2015

Fishrock123 commented Nov 4, 2015

mcornac commented Nov 4, 2015

Fishrock123 commented Nov 4, 2015

joransiu commented Nov 4, 2015

mcornac commented Nov 4, 2015

Fishrock123 Nov 9, 2015

Choose a reason for hiding this comment

Fishrock123 commented Nov 9, 2015

Fishrock123 commented Nov 9, 2015

mcornac commented Nov 9, 2015

Fishrock123 commented Nov 9, 2015

mcornac commented Nov 9, 2015

Fishrock123 commented Nov 9, 2015

Fishrock123 Nov 9, 2015

Choose a reason for hiding this comment

mcornac commented Nov 16, 2015

Fishrock123 commented Nov 19, 2015

Fishrock123 commented Nov 20, 2015

Fishrock123 commented Nov 20, 2015

misterdjules commented Nov 20, 2015

Fishrock123 commented Nov 20, 2015

misterdjules commented Nov 20, 2015

Fishrock123 commented Nov 20, 2015

misterdjules commented Nov 20, 2015