-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: fix race condition setting suicide prop #4349
Conversation
const worker = cluster.fork(); | ||
let disconnected = false; | ||
function forkWorker(action) { | ||
const worker = cluster.fork({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this: const worker = cluster.fork({ action });
LGTM. CI seems like it might not be in great shape right now, but trying anyway: https://ci.nodejs.org/job/node-test-pull-request/1045/ |
acb7566
to
982acc4
Compare
@cjihrig updated. Thanks! |
@bnoordhuis what is your take on the semver-ness of this change. I don't think it should break anyone. |
Any idea if this fixes #4205? That would be awesome. |
Following up on my own question about how this affects #4205: On OS X, the code below exits normally with current master. With the change proposed in this PR, the code does not exit. On Linux (ubuntu 14.04), the following code throws an Is the fact that this code does not exit a bug that needs to be addressed? Or is it expected behavior?
|
982acc4
to
be062b8
Compare
Updated |
@@ -658,15 +663,21 @@ function workerInit() { | |||
if (accepted) server.onconnection(0, handle); | |||
} | |||
|
|||
Worker.prototype.disconnect = function() { | |||
Worker.prototype.disconnect = function(fromParent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you name this something like masterInitiated
?
Yes, still LGTM with comments. |
send({ act: 'suicide' }); | ||
process.disconnect(); | ||
// if called from worker, wait for ack to be sure suicide property is | ||
// sent in the master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should that be set
rather than sent
?
This would be Might it be better to do it in a way that allows the information to be sent when If we are going to add a parameter, I'd be more inclined to add an |
Maybe have a private function (say, |
When I run your modified version of the test file with the current master, it does not fail. Are you able to modify it so that it fails every time with the current master and succeeds every time with the patch here? (Or do I need to run it on a particular OS? I'm running it on OS X.) Otherwise, if the bug gets reintroduced, it's probably more likely that the test will be marked as flaky than it will be considered a bug in the code, as it probably won't fail on the commit that introduces the bug but rather on a subsequent unrelated commit. |
I'm trying to figure out if #4418 would fix the issue you are experiencing or not. I was going to find out by simply seeing if that code passes the test, but since current master passes the test, Thanks, by the way, for all you do in trying to squash flaky tests. It is really valuable work and challenging in a pull-your-hair-out-and-bang-your-head-on-the-table kind of way. (EDIT: In case my hyphenated phrase is unclear, people don't always appreciate just how agonizing it can be to try to squash these bugs, so I want to say how much I appreciate you doing so much lately to do exactly that.) |
Stress test with current master: https://ci.nodejs.org/job/node-stress-single-test/198/nodes=ubuntu1404-64/console No failures on test-regress-GH-3238 after 9999 consecutive test runs on 64-bit ubuntu 14. What operating system are you seeing these failures on? |
@Trott IIRC the test was failing to me in |
be062b8
to
d83edc3
Compare
@santigimeno Judging from the stack trace you posted, it sure does seem like either I'd be reluctant to make a change like this without having a test that fails without the change and passes with it. |
@@ -523,6 +528,35 @@ function workerInit() { | |||
process: process, | |||
state: 'online' | |||
}); | |||
|
|||
worker._disconnect = function(masterInitiated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still be exposed to the end user. Can we make it private to this module rather than exposing it or is there a reason it cannot be done that way?
d83edc3
to
f3a3370
Compare
@Trott I'm able to reproduce the error consistently in my
For example, having 16 copies of the test I'm getting the error before the tenth iteration most of the time:
I don't know if it'll be reproducible in other computers as I'm suspecting there's something in my computer that makes flaky tests fail more easily. |
Finally all is green :D |
OK! Since it has gone through so many iterations since the last Does this still look good to you, @cjihrig ? |
Oh, and since there are changes to the previously existing regression test, we should double-check that it still fires on the regression it was written to catch. |
assert.strictEqual(code, 0, 'worker exited with error'); | ||
}, tries * 2)); | ||
|
||
for (let i = 0; i < tries; ++ i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary space between ++
and i
.
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event.
f63b35c
to
0fb3ec9
Compare
PR updated. Thanks! |
The changes to Could you clarify the purpose of |
@cjihrig the new test forks lots of workers so the issue this PR tries to solve happens consistently. With the original test it barely did. |
When you say consistently, is it reproduced 100% of the time? |
At least the last time I checked (it's been a few days) it did in the platforms I tried: |
OK, well unless anyone objects to this, I'm cool with it. |
@Trott did you delete your testing rant from here? I was going to say that you need to turn it into a skeleton of a "how to write tests for core" doc for @nodejs/testing to iterate on and attach to the contributor guidelines. We need a rulebook we can follow when reviewing tests. |
+1 that would be quite valuable
|
@rvagg @jasnell I deleted it from GitHub seconds after posting it, but not before saving it off somewhere to use later. Creating guidelines for tests is one item on a very long list of things I'd like to see the Testing WG do. First meeting is Friday. Check us out. nodejs/testing#1 |
Confirmed that the test still fires an https://ci.nodejs.org/job/node-test-pull-request/1228/ There are two tests on Windows that are failing on master and will continue to fail until #4679 lands (or some alternate fix). As long as nothing else is up with the CI, will land. Thanks for your persistence on this, @santigimeno! |
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
Landed in 9571be1. (Nice detailed git commit message, too, @santigimeno!) |
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-GH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: #4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-GH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: #4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-GH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: #4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
There is no guarantee that the
suicide
property of a worker in the masterprocess is going to be set when the
disconnect
andexit
events are emitted.To fix it, wait for the ACK of the suicide message from the master before
disconnecting the worker.
Modify
test-regress-GH-3238
so it checks both thekill
anddisconnect
cases. Also take into account that the
disconnect
event may be received afterthe
exit
event.I discovered this because I was sometimes getting this error:
I don't know if it's the proper fix though