Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fix race condition setting suicide prop #4349

Closed
wants to merge 1 commit into from

Conversation

santigimeno
Copy link
Member

There is no guarantee that the suicide property of a worker in the master
process is going to be set when the disconnect and exit events are emitted.
To fix it, wait for the ACK of the suicide message from the master before
disconnecting the worker.
Modify test-regress-GH-3238 so it checks both the kill and disconnect
cases. Also take into account that the disconnect event may be received after
the exit event.

I discovered this because I was sometimes getting this error:

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: false === true
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/parallel/test-regress-GH-3238.js:17:12)
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/common.js:401:15)
    at emitTwo (events.js:88:13)
    at Worker.emit (events.js:173:7)
    at ChildProcess.<anonymous> (cluster.js:361:14)
    at ChildProcess.g (events.js:264:16)
    at emitTwo (events.js:88:13)
    at ChildProcess.emit (events.js:173:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)

I don't know if it's the proper fix though

@mscdex mscdex added the cluster Issues and PRs related to the cluster subsystem. label Dec 18, 2015
const worker = cluster.fork();
let disconnected = false;
function forkWorker(action) {
const worker = cluster.fork({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this: const worker = cluster.fork({ action });

@cjihrig
Copy link
Contributor

cjihrig commented Dec 22, 2015

LGTM.

CI seems like it might not be in great shape right now, but trying anyway: https://ci.nodejs.org/job/node-test-pull-request/1045/

@santigimeno
Copy link
Member Author

@cjihrig updated. Thanks!

@cjihrig
Copy link
Contributor

cjihrig commented Dec 22, 2015

@bnoordhuis what is your take on the semver-ness of this change. I don't think it should break anyone.

@Trott
Copy link
Member

Trott commented Dec 22, 2015

Any idea if this fixes #4205? That would be awesome.

@Trott
Copy link
Member

Trott commented Dec 23, 2015

Following up on my own question about how this affects #4205:

On OS X, the code below exits normally with current master. With the change proposed in this PR, the code does not exit.

On Linux (ubuntu 14.04), the following code throws an AssertionError (which is bug #4205) with current master. With the change proposed here, the AssertionError goes away but the code does not exit.

Is the fact that this code does not exit a bug that needs to be addressed? Or is it expected behavior?

'use strict';
const net = require('net');
const cluster = require('cluster');
cluster.schedulingPolicy = cluster.SCHED_NONE;

if (cluster.isMaster) {
  var worker1, worker2;

  worker1 = cluster.fork();
  worker1.on('message', function() {
    worker2 = cluster.fork();
    worker1.disconnect();
    worker2.on('online', worker2.disconnect);
  });

  return;
}

var server = net.createServer();

server.listen(12346, function() {
  process.send('listening');
});

@santigimeno
Copy link
Member Author

@Trott Changing the behavior of disconnect to synchronous when called from the parent makes your test code fail as described in #4205. I'll apply that change.

@santigimeno
Copy link
Member Author

Updated

@Trott
Copy link
Member

Trott commented Dec 23, 2015

OK, so it doesn't fix #4205 but it also doesn't change the nature of that problem. And it gives me a pretty good idea of where to look to potentially figure out what the real issue is there. Thanks.

@cjihrig Does your previous LGTM stand with the additional modification that's been made?

@@ -658,15 +663,21 @@ function workerInit() {
if (accepted) server.onconnection(0, handle);
}

Worker.prototype.disconnect = function() {
Worker.prototype.disconnect = function(fromParent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you name this something like masterInitiated?

@cjihrig
Copy link
Contributor

cjihrig commented Dec 23, 2015

Yes, still LGTM with comments.

send({ act: 'suicide' });
process.disconnect();
// if called from worker, wait for ack to be sure suicide property is
// sent in the master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should that be set rather than sent?

@Trott
Copy link
Member

Trott commented Dec 24, 2015

This would be semver-minor because it adds a parameter to worker.disconnect(), right? And, in that case, it would need a doc update as well.

Might it be better to do it in a way that allows the information to be sent when disconnect() is being called internally within cluster.js but not in a way that exposes a new API interface to the end user? This is a function parameter that no end user would ever need, right?

If we are going to add a parameter, I'd be more inclined to add an options object and check for options.fromParent (or whatever property name is appropriate) rather than a fromParent function parameter.

@Trott
Copy link
Member

Trott commented Dec 24, 2015

Maybe have a private function (say, _disconnect()) that takes the parameter so that it's available to cluster.js itself, and have Worker.prototype.disconnect() call _disconnect()?

@Trott
Copy link
Member

Trott commented Dec 24, 2015

When I run your modified version of the test file with the current master, it does not fail. Are you able to modify it so that it fails every time with the current master and succeeds every time with the patch here? (Or do I need to run it on a particular OS? I'm running it on OS X.) Otherwise, if the bug gets reintroduced, it's probably more likely that the test will be marked as flaky than it will be considered a bug in the code, as it probably won't fail on the commit that introduces the bug but rather on a subsequent unrelated commit.

@Trott
Copy link
Member

Trott commented Dec 24, 2015

I'm trying to figure out if #4418 would fix the issue you are experiencing or not. I was going to find out by simply seeing if that code passes the test, but since current master passes the test, ¯\_(ツ)_/¯. Any ideas? Maybe I'll run some stress tests...

Thanks, by the way, for all you do in trying to squash flaky tests. It is really valuable work and challenging in a pull-your-hair-out-and-bang-your-head-on-the-table kind of way. (EDIT: In case my hyphenated phrase is unclear, people don't always appreciate just how agonizing it can be to try to squash these bugs, so I want to say how much I appreciate you doing so much lately to do exactly that.)

@Trott
Copy link
Member

Trott commented Dec 25, 2015

Stress test with current master: https://ci.nodejs.org/job/node-stress-single-test/198/nodes=ubuntu1404-64/console

No failures on test-regress-GH-3238 after 9999 consecutive test runs on 64-bit ubuntu 14. What operating system are you seeing these failures on?

@santigimeno
Copy link
Member Author

@Trott IIRC the test was failing to me in OS X. It only failed when the exit event was received before the disconnect event. I'm not sure there's an easy way to simulate that scenario in a consistent way.
BTW thank for your kind words :)

@santigimeno
Copy link
Member Author

@Trott @cjihrig I have updated the PR. I think I have addressed all your comments. Let me know what you think. Thanks!

@Trott
Copy link
Member

Trott commented Dec 26, 2015

@santigimeno Judging from the stack trace you posted, it sure does seem like either exit fires before disconnect or else disconnect does not fire at all. But I am unable to reproduce this problem. I'm still trying, though. (I'm running a stress test on OS X right now.)

I'd be reluctant to make a change like this without having a test that fails without the change and passes with it.

@@ -523,6 +528,35 @@ function workerInit() {
process: process,
state: 'online'
});

worker._disconnect = function(masterInitiated) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still be exposed to the end user. Can we make it private to this module rather than exposing it or is there a reason it cannot be done that way?

@santigimeno
Copy link
Member Author

@Trott I'm able to reproduce the error consistently in my OS X computer by running multiple test-regress-GH-3238 tests in parallel like this:

while true; do /usr/bin/python tools/test.py --mode=release parallel/test-regress-GH-3238* -J; if [ $? -ne 0 ]; then break; fi; done

For example, having 16 copies of the test I'm getting the error before the tenth iteration most of the time:

while true; do /usr/bin/python tools/test.py --mode=release parallel/test-regress-GH-3238* -J; if [ $? -ne 0 ]; then break; fi; done
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
=== release test-regress-GH-3238-6 ===                     
Path: parallel/test-regress-GH-3238-6
assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: false === true
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/parallel/test-regress-GH-3238-6.js:16:12)
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/common.js:401:15)
    at emitTwo (events.js:88:13)
    at Worker.emit (events.js:173:7)
    at ChildProcess.<anonymous> (cluster.js:361:14)
    at ChildProcess.g (events.js:264:16)
    at emitTwo (events.js:88:13)
    at ChildProcess.emit (events.js:173:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
Command: out/Release/node /Users/sgimeno/node/node/test/parallel/test-regress-GH-3238-6.js
[00:00|% 100|+  15|-   1]: Done  

I don't know if it'll be reproducible in other computers as I'm suspecting there's something in my computer that makes flaky tests fail more easily.

@Trott
Copy link
Member

Trott commented Jan 8, 2016

@santigimeno
Copy link
Member Author

Finally all is green :D

@Trott
Copy link
Member

Trott commented Jan 8, 2016

OK! Since it has gone through so many iterations since the last LGTM, it is probably good to ask reviewers to take one last look to make sure it is still OK by them.

Does this still look good to you, @cjihrig ?

@Trott
Copy link
Member

Trott commented Jan 8, 2016

Oh, and since there are changes to the previously existing regression test, we should double-check that it still fires on the regression it was written to catch.

assert.strictEqual(code, 0, 'worker exited with error');
}, tries * 2));

for (let i = 0; i < tries; ++ i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary space between ++ and i.

There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.
To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.
Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.
Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.
@santigimeno
Copy link
Member Author

PR updated. Thanks!

@cjihrig
Copy link
Contributor

cjihrig commented Jan 13, 2016

The changes to lib/cluster.js and the existing test LGTM.

Could you clarify the purpose of test-cluster-disconnect-suicide-race.js please. It looks a lot like the other test, but with the addition of tries, which looks somewhat arbitrary.

@santigimeno
Copy link
Member Author

@cjihrig the new test forks lots of workers so the issue this PR tries to solve happens consistently. With the original test it barely did.

@cjihrig
Copy link
Contributor

cjihrig commented Jan 13, 2016

When you say consistently, is it reproduced 100% of the time?

@santigimeno
Copy link
Member Author

At least the last time I checked (it's been a few days) it did in the platforms I tried: jessie 64, OS X and FreeBSD. I just tested in a jessie 64 box and the test fails in current master.

@cjihrig
Copy link
Contributor

cjihrig commented Jan 13, 2016

OK, well unless anyone objects to this, I'm cool with it.

@rvagg
Copy link
Member

rvagg commented Jan 14, 2016

@Trott did you delete your testing rant from here? I was going to say that you need to turn it into a skeleton of a "how to write tests for core" doc for @nodejs/testing to iterate on and attach to the contributor guidelines. We need a rulebook we can follow when reviewing tests.

@jasnell
Copy link
Member

jasnell commented Jan 14, 2016

+1 that would be quite valuable
On Jan 13, 2016 4:30 PM, "Rod Vagg" notifications@github.com wrote:

@Trott https://github.com/Trott did you delete your testing rant from
here? I was going to say that you need to turn it into a skeleton of a "how
to write tests for core" doc for @nodejs/testing
https://github.com/orgs/nodejs/teams/testing to iterate on and attach
to the contributor guidelines. We need a rulebook we can follow when
reviewing tests.


Reply to this email directly or view it on GitHub
#4349 (comment).

@Trott
Copy link
Member

Trott commented Jan 14, 2016

@rvagg @jasnell I deleted it from GitHub seconds after posting it, but not before saving it off somewhere to use later. Creating guidelines for tests is one item on a very long list of things I'd like to see the Testing WG do. First meeting is Friday. Check us out. nodejs/testing#1

@Trott
Copy link
Member

Trott commented Jan 14, 2016

Confirmed that the test still fires an AssertionError in Node.js 4.1.2 so we're all set, I think. One final CI run because I don't ever need more than a flimsy excuse to do a CI run.

https://ci.nodejs.org/job/node-test-pull-request/1228/

There are two tests on Windows that are failing on master and will continue to fail until #4679 lands (or some alternate fix). As long as nothing else is up with the CI, will land. Thanks for your persistence on this, @santigimeno!

Trott pushed a commit to Trott/io.js that referenced this pull request Jan 14, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: nodejs#4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@Trott
Copy link
Member

Trott commented Jan 14, 2016

Landed in 9571be1. (Nice detailed git commit message, too, @santigimeno!)

@Trott Trott closed this Jan 14, 2016
rvagg pushed a commit that referenced this pull request Jan 14, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-GH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: #4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jan 28, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-GH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: #4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 11, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-GH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: #4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit to MylesBorins/node that referenced this pull request Feb 11, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: nodejs#4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Feb 11, 2016
MylesBorins pushed a commit to MylesBorins/node that referenced this pull request Feb 15, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: nodejs#4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
scovetta pushed a commit to scovetta/node that referenced this pull request Apr 2, 2016
There is no guarantee that the `suicide` property of a worker in the
master process is going to be set when the `disconnect` and `exit`
events are emitted.

To fix it, wait for the ACK of the suicide message from the master
before disconnecting the worker. Also, there's no need to send the
suicide message from the worker if the disconnection has been
initiated in the master.

Add `test-cluster-disconnect-suicide-race` that forks a lot of workers
to consistently reproduce the issue this patch tries to solve.

Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and
`disconnect` cases. Also take into account that the `disconnect` event
may be received after the `exit` event.

PR-URL: nodejs#4349
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants