Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fix broken/hanging tests #1934

Closed
wants to merge 1 commit into from

Conversation

Olegas
Copy link
Contributor

@Olegas Olegas commented Jun 10, 2015

Fixes: #1933

@Olegas Olegas added cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests. labels Jun 10, 2015
@Olegas
Copy link
Contributor Author

Olegas commented Jun 10, 2015

/cc @sam-github @Fishrock123

setTimeout(function() {
socket.write('.');
connectionDone = true;
setTimeout(function(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wfm on linux, what system is this hanging on? why is this timeout needed? putting random timeouts in to solve race conditions is fragile, and in this case, I don't see what is being waited for... the worker is listening, the tcp connection is established... what's the pause for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, see this comment: #1400 (comment)

ADD: This is because of internal implementation details of SCHED_RR scheduling policy.

Again, this is not imaginary problem. See: #1896 (comment)

And I can reproduce this at least in my Mac OS from current master

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Olegas I didn't say it was imaginary! I said that inserting arbitrary timeouts to solve undescribed race conditions is poor practice. You didn't answer my question: what is the race condition?

Can you describe what ordering of scheduling events is causing this to break on OS X? Usually, rather than timeouts, its possible to explicitly wait for precise events to avoid race conditions and timeouts. If node lacks such events, then timeouts are our last resort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed my previous comment. I found an error in my explanation.

@jbergstroem
Copy link
Member

This test has been timing out on multiple buildbots on several occasions as of late.

@sam-github
Copy link
Contributor

@jbergstroem Any idea how to convince ci to build a PR? I tried putting the pr number in the paramaterized build, and I tried removing master, and putting the PR number in, and it always seems to be the master that is tested.

@sam-github
Copy link
Contributor

https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/802/, for example.

Anyhow, I'm fine with merging this, but since I didn't run it through CI last time, maybe I should this time.

Thoughts, @Fishrock123 @jbergstroem ?

@jbergstroem
Copy link
Member

@sam-github I always go for user/branch when setting up the parameterised build -- in this case "Olegas" as user and "fix-cluster-test" for branch.

Edit: cancel the current run and get a new one going with above suggestion. Lets see how that one pans out!

@sam-github
Copy link
Contributor

OK, figured it out, test running here: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/806/

@jbergstroem
Copy link
Member

I got a lint issue:

./iojs tools/eslint/bin/eslint.js src lib test --reset --quiet

test/parallel/test-cluster-worker-wait-server-close.js
  25:27  error  Missing space before opening brace  space-before-blocks

✖ 1 problem (1 error, 0 warnings)

Edit: some of the bots seems to still time out (freebsd101-32, osx1010, freebsd101-64, centos5-32, smartos14-32, smartos14-64)

setTimeout(function() {
socket.write('.');
connectionDone = true;
worker.disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what's the difference between using setTimeout vs passing a callback to .disconnect()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically have it how it was originally except put the call into the .disconnect(). if that callback is never called then there must be an issue with the patch itself on some platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue with my approach. still working through it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevnorris the disconnect callback will never be called, not while there is an open connection to the worker. the test calls disconnect, and waits one second to confirm that disconnect did NOT in fact, cause disconnection... only when the socket is closed, triggered by the write('.'), does the disconnect proceed.

@sam-github
Copy link
Contributor

@jbergstroem I think we should revert 9c0a1b8 on master, and re-merge this PR when CI is good. Is that the right thing here?

@sam-github
Copy link
Contributor

@Olegas I cleaned up the lint when I merged the first version, you should make lint on your PRs.

@trevnorris
Copy link
Contributor

I have an alternative test that gets around the need for the setTimeout. I'm going to land it in a branch and run it against CI. give me a minute and I'll post the build number.

@jbergstroem
Copy link
Member

@trevnorris sounds good.

@trevnorris
Copy link
Contributor

@sam-github that is normally the correct thing to do. unless it's critical (e.g. a patch of mine recently caused a segfault when using one module) it's not abnormal that we give the author of the patch warning and a day or so to figure it out. not uncommon that a small thing was missed and can be easily fixed.

@trevnorris
Copy link
Contributor

CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/808/

Feel free to review my changes to the test: trevnorris@0e6d146

Change: trevnorris@768e6b9

@trevnorris
Copy link
Contributor

crap. scrap that one. missed a '.' in a path. canceling last one, and might have brought down jenkins in the process...

@jbergstroem
Copy link
Member

@trevnorris investigating jenkins - stalling over here as well. Unfortunately lack access to the host machine though (for now).

@trevnorris
Copy link
Contributor

Thanks. Working again.

New CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/809/

@sam-github
Copy link
Contributor

@trevnorris its late, I must be misreading your version: no writes occur except in on data handlers... which is a chicken or the egg scenario, noone will ever write. The essential test script is this:

  1. worker is listening
  2. something connects to worker
  3. master calls worker.disconnect()
  4. worker does NOT disconnect
  5. wait a bit... to make sure worker did NOT disconnect
  6. close the tcp connection
  7. worker disconnects now that it doesn't have a connection

So, some kind of timeout is necessary... unless you wait a bit, how do you know something has not/will not happen?

@jbergstroem
Copy link
Member

Still seems to be failing [by timeout], unfortunately. Output:

105 - test-cluster-worker-wait-server-close.js  
duration_ms  60.9
# master connecting to worker server
# master connected to worker server

@trevnorris
Copy link
Contributor

yup. didn't commit a line I changed. this is what I get for coding so late. have a fix. one min.

@sam-github you are correct.

sam-github added a commit that referenced this pull request Jun 11, 2015
This reverts commit 9c0a1b8.

CI is timing out, work is continuing in
#1934
@trevnorris
Copy link
Contributor

@Olegas
Copy link
Contributor Author

Olegas commented Jun 11, 2015

@sam-github @trevnorris I don't understand the difference (

And it seems we have another problem here... I've picked @trevnorris 's test and run it - evething is OK, it passed.

Then I just comment out ALL console.log and test is not passing now.

I've started investigation and now I can leave all console.log's untouched and comment just one console.log('master connecting to worker server') and test start hanging again.

@trevnorris
Copy link
Contributor

@Olegas what platform?

@kkoopa
Copy link

kkoopa commented Jun 11, 2015

Is there any native code involved here? That kind of problem usually shows up when doing Function.Call directly from C++ instead of going through MakeCallback.

@trevnorris
Copy link
Contributor

@kkoopa no native code. here's some interesting output showing the test pass on freebsd when run directly, but then fails running it through test.py: https://gist.github.com/jbergstroem/9b9f11f0e43e21cf5cb7

I'm running another slight variation of the test now to get a better idea of where it's failing.

@Olegas
Copy link
Contributor Author

Olegas commented Jun 11, 2015

@trevnorris Mac OS X

@trevnorris
Copy link
Contributor

@Olegas Thanks. Here's the latest run from my test: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/812/nodes=centos5-32/tapTestReport/test.tap-105/

So it seems there's an issue with the server accepting the client's connection. I'm not sure where to go from there in terms of troubleshooting.

/cc @jbergstroem

@sam-github
Copy link
Contributor

I'm testing @Olegas 's theory that the connection is accepted by the cluster master, but the sockfd is not sent to the worker yet: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/813/

@sam-github
Copy link
Contributor

My branch passed on osx 1010! That looks promising, but I have to go catch the bus to nodeconf now.

@trevnorris
Copy link
Contributor

Will the "onconnection" callback for the server not fire in that case?

@Olegas
Copy link
Contributor Author

Olegas commented Jun 11, 2015

I've added console.logs to lib/cluster.js and lib/internal/child-process.js and can see sometimes newconn and disconnect events delivered in reverse order.
Worker closes it's handle and perform process.disconnect(), IPC pipe is getting closed. And in this single moment master is trying to send newconn event and never get called back (handle transmission is not get ack'd so we can't ever get here: https://github.com/nodejs/io.js/blob/master/lib/cluster.js#L181)

Next, I can't see the disconnect event on worker cause' it's IPC pipe is waiting for handle ACK
It stops here: https://github.com/nodejs/io.js/blob/master/lib/internal/child_process.js#L588

this._handleQueue is an empty array created here https://github.com/nodejs/io.js/blob/master/lib/internal/child_process.js#L554 when newconn message is sent with handle attached.

@Olegas
Copy link
Contributor Author

Olegas commented Jun 11, 2015

So, I think there are two problems:

  1. newconn and disconnect messages swap
  2. Deadlock while waiting for handle ack, thus no disconnect event on worker inside master process what leads to test hanged test

@trevnorris
Copy link
Contributor

My test shows the server never actually accepts the connection. Otherwise messages would have been passed between them. I don't see how that relates to (2) and unsure if it could relate to (1).

@sam-github
Copy link
Contributor

It sounds like the bug is hitting problems with robustness of master/worker protocol by doing a disconnect during a connection.

@jbergstroem
Copy link
Member

Can confirm that the commit in sam-github/pr-1934-redux fixes the test as shown in jenkins and my own tests.

@sam-github
Copy link
Contributor

replaced by #1953

@sam-github sam-github closed this Jun 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants