cluster: fix broken/hanging tests #1934

Olegas · 2015-06-10T07:21:42Z

Fixes: #1933

Olegas · 2015-06-10T08:37:28Z

sam-github · 2015-06-10T13:54:45Z

test/parallel/test-cluster-worker-wait-server-close.js

-      setTimeout(function() {
-        socket.write('.');
-        connectionDone = true;
+      setTimeout(function(){


wfm on linux, what system is this hanging on? why is this timeout needed? putting random timeouts in to solve race conditions is fragile, and in this case, I don't see what is being waited for... the worker is listening, the tcp connection is established... what's the pause for?

Please, see this comment: #1400 (comment)

ADD: This is because of internal implementation details of SCHED_RR scheduling policy.

Again, this is not imaginary problem. See: #1896 (comment)

And I can reproduce this at least in my Mac OS from current master

@Olegas I didn't say it was imaginary! I said that inserting arbitrary timeouts to solve undescribed race conditions is poor practice. You didn't answer my question: what is the race condition?

Can you describe what ordering of scheduling events is causing this to break on OS X? Usually, rather than timeouts, its possible to explicitly wait for precise events to avoid race conditions and timeouts. If node lacks such events, then timeouts are our last resort.

Removed my previous comment. I found an error in my explanation.

jbergstroem · 2015-06-10T23:47:07Z

This test has been timing out on multiple buildbots on several occasions as of late.

sam-github · 2015-06-11T03:45:59Z

@jbergstroem Any idea how to convince ci to build a PR? I tried putting the pr number in the paramaterized build, and I tried removing master, and putting the PR number in, and it always seems to be the master that is tested.

sam-github · 2015-06-11T03:49:07Z

https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/802/, for example.

Anyhow, I'm fine with merging this, but since I didn't run it through CI last time, maybe I should this time.

Thoughts, @Fishrock123 @jbergstroem ?

jbergstroem · 2015-06-11T03:49:19Z

@sam-github I always go for user/branch when setting up the parameterised build -- in this case "Olegas" as user and "fix-cluster-test" for branch.

Edit: cancel the current run and get a new one going with above suggestion. Lets see how that one pans out!

sam-github · 2015-06-11T04:20:55Z

OK, figured it out, test running here: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/806/

jbergstroem · 2015-06-11T04:33:05Z

I got a lint issue:

./iojs tools/eslint/bin/eslint.js src lib test --reset --quiet

test/parallel/test-cluster-worker-wait-server-close.js
  25:27  error  Missing space before opening brace  space-before-blocks

✖ 1 problem (1 error, 0 warnings)

Edit: some of the bots seems to still time out (freebsd101-32, osx1010, freebsd101-64, centos5-32, smartos14-32, smartos14-64)

Fixes: nodejs#1933

trevnorris · 2015-06-11T05:06:26Z

test/parallel/test-cluster-worker-wait-server-close.js

      setTimeout(function() {
-        socket.write('.');
-        connectionDone = true;
+        worker.disconnect();


Just curious, what's the difference between using setTimeout vs passing a callback to .disconnect()?

basically have it how it was originally except put the call into the .disconnect(). if that callback is never called then there must be an issue with the patch itself on some platforms.

I see the issue with my approach. still working through it.

@trevnorris the disconnect callback will never be called, not while there is an open connection to the worker. the test calls disconnect, and waits one second to confirm that disconnect did NOT in fact, cause disconnection... only when the socket is closed, triggered by the write('.'), does the disconnect proceed.

sam-github · 2015-06-11T05:22:14Z

@jbergstroem I think we should revert 9c0a1b8 on master, and re-merge this PR when CI is good. Is that the right thing here?

sam-github · 2015-06-11T05:23:39Z

@Olegas I cleaned up the lint when I merged the first version, you should make lint on your PRs.

trevnorris · 2015-06-11T05:29:22Z

I have an alternative test that gets around the need for the setTimeout. I'm going to land it in a branch and run it against CI. give me a minute and I'll post the build number.

jbergstroem · 2015-06-11T05:30:41Z

@trevnorris sounds good.

trevnorris · 2015-06-11T05:31:19Z

@sam-github that is normally the correct thing to do. unless it's critical (e.g. a patch of mine recently caused a segfault when using one module) it's not abnormal that we give the author of the patch warning and a day or so to figure it out. not uncommon that a small thing was missed and can be easily fixed.

trevnorris · 2015-06-11T05:34:07Z

CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/808/

Feel free to review my changes to the test: ~~trevnorris@0e6d146~~

Change: trevnorris@768e6b9

trevnorris · 2015-06-11T05:35:46Z

crap. scrap that one. missed a '.' in a path. canceling last one, and might have brought down jenkins in the process...

jbergstroem · 2015-06-11T05:37:54Z

@trevnorris investigating jenkins - stalling over here as well. Unfortunately lack access to the host machine though (for now).

trevnorris · 2015-06-11T05:41:26Z

Thanks. Working again.

New CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/809/

sam-github · 2015-06-11T05:42:59Z

@trevnorris its late, I must be misreading your version: no writes occur except in on data handlers... which is a chicken or the egg scenario, noone will ever write. The essential test script is this:

worker is listening
something connects to worker
master calls worker.disconnect()
worker does NOT disconnect
wait a bit... to make sure worker did NOT disconnect
close the tcp connection
worker disconnects now that it doesn't have a connection

So, some kind of timeout is necessary... unless you wait a bit, how do you know something has not/will not happen?

jbergstroem · 2015-06-11T05:52:58Z

Still seems to be failing [by timeout], unfortunately. Output:

105 - test-cluster-worker-wait-server-close.js  
duration_ms  60.9
# master connecting to worker server
# master connected to worker server

trevnorris · 2015-06-11T05:54:49Z

yup. didn't commit a line I changed. this is what I get for coding so late. have a fix. one min.

@sam-github you are correct.

This reverts commit 9c0a1b8. CI is timing out, work is continuing in #1934

trevnorris · 2015-06-11T06:01:57Z

Alright, one last time: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/810/

Olegas · 2015-06-11T07:33:37Z

@sam-github @trevnorris I don't understand the difference (

And it seems we have another problem here... I've picked @trevnorris 's test and run it - evething is OK, it passed.

Then I just comment out ALL console.log and test is not passing now.

I've started investigation and now I can leave all console.log's untouched and comment just one console.log('master connecting to worker server') and test start hanging again.

trevnorris · 2015-06-11T15:48:09Z

@Olegas what platform?

kkoopa · 2015-06-11T16:29:50Z

Is there any native code involved here? That kind of problem usually shows up when doing Function.Call directly from C++ instead of going through MakeCallback.

trevnorris · 2015-06-11T17:34:22Z

@kkoopa no native code. here's some interesting output showing the test pass on freebsd when run directly, but then fails running it through test.py: https://gist.github.com/jbergstroem/9b9f11f0e43e21cf5cb7

I'm running another slight variation of the test now to get a better idea of where it's failing.

Olegas · 2015-06-11T18:24:24Z

@trevnorris Mac OS X

trevnorris · 2015-06-11T18:28:16Z

@Olegas Thanks. Here's the latest run from my test: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/812/nodes=centos5-32/tapTestReport/test.tap-105/

So it seems there's an issue with the server accepting the client's connection. I'm not sure where to go from there in terms of troubleshooting.

/cc @jbergstroem

sam-github · 2015-06-11T18:33:17Z

I'm testing @Olegas 's theory that the connection is accepted by the cluster master, but the sockfd is not sent to the worker yet: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/813/

sam-github · 2015-06-11T18:39:39Z

My branch passed on osx 1010! That looks promising, but I have to go catch the bus to nodeconf now.

trevnorris · 2015-06-11T18:51:30Z

Will the "onconnection" callback for the server not fire in that case?

Olegas · 2015-06-11T19:40:17Z

I've added console.logs to lib/cluster.js and lib/internal/child-process.js and can see sometimes newconn and disconnect events delivered in reverse order.
Worker closes it's handle and perform process.disconnect(), IPC pipe is getting closed. And in this single moment master is trying to send newconn event and never get called back (handle transmission is not get ack'd so we can't ever get here: https://github.com/nodejs/io.js/blob/master/lib/cluster.js#L181)

Next, I can't see the disconnect event on worker cause' it's IPC pipe is waiting for handle ACK
It stops here: https://github.com/nodejs/io.js/blob/master/lib/internal/child_process.js#L588

this._handleQueue is an empty array created here https://github.com/nodejs/io.js/blob/master/lib/internal/child_process.js#L554 when newconn message is sent with handle attached.

Olegas · 2015-06-11T19:45:36Z

So, I think there are two problems:

newconn and disconnect messages swap
Deadlock while waiting for handle ack, thus no disconnect event on worker inside master process what leads to test hanged test

trevnorris · 2015-06-11T20:06:37Z

My test shows the server never actually accepts the connection. Otherwise messages would have been passed between them. I don't see how that relates to (2) and unsure if it could relate to (1).

sam-github · 2015-06-11T21:09:49Z

It sounds like the bug is hitting problems with robustness of master/worker protocol by doing a disconnect during a connection.

jbergstroem · 2015-06-11T22:57:53Z

Can confirm that the commit in sam-github/pr-1934-redux fixes the test as shown in jenkins and my own tests.

sam-github · 2015-06-12T04:57:57Z

replaced by #1953

Olegas added cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests. labels Jun 10, 2015

sam-github reviewed Jun 10, 2015
View reviewed changes

jbergstroem mentioned this pull request Jun 11, 2015

Release proposal: 2.3.0 #1939

Closed

cluster: fix broken/hanging tests

c2b44a4

Fixes: nodejs#1933

Olegas force-pushed the fix-cluster-test branch from da82f00 to c2b44a4 Compare June 11, 2015 05:00

trevnorris reviewed Jun 11, 2015
View reviewed changes

sam-github added a commit that referenced this pull request Jun 11, 2015

Revert "cluster: wait on servers closing before disconnect"

e194480

This reverts commit 9c0a1b8. CI is timing out, work is continuing in #1934

sam-github mentioned this pull request Jun 11, 2015

Revert "cluster: wait on servers closing before disconnect" #1945

Closed

sam-github mentioned this pull request Jun 11, 2015

fix cluster test for disconnect with open connections #1953

Closed

sam-github closed this Jun 12, 2015

cluster: fix broken/hanging tests #1934

cluster: fix broken/hanging tests #1934

Conversation

Olegas commented Jun 10, 2015

Olegas commented Jun 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbergstroem commented Jun 10, 2015

sam-github commented Jun 11, 2015

sam-github commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

sam-github commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-github commented Jun 11, 2015

sam-github commented Jun 11, 2015

trevnorris commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

trevnorris commented Jun 11, 2015

trevnorris commented Jun 11, 2015

trevnorris commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

trevnorris commented Jun 11, 2015

sam-github commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

trevnorris commented Jun 11, 2015

trevnorris commented Jun 11, 2015

Olegas commented Jun 11, 2015

trevnorris commented Jun 11, 2015

kkoopa commented Jun 11, 2015

trevnorris commented Jun 11, 2015

Olegas commented Jun 11, 2015

trevnorris commented Jun 11, 2015

sam-github commented Jun 11, 2015

sam-github commented Jun 11, 2015

trevnorris commented Jun 11, 2015

Olegas commented Jun 11, 2015

Olegas commented Jun 11, 2015

trevnorris commented Jun 11, 2015

sam-github commented Jun 11, 2015

jbergstroem commented Jun 11, 2015

sam-github commented Jun 12, 2015