Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-child-process-fork-regr-gh-2847 - fails on pLinux BE and AIX #3245

Closed
mhdawson opened this issue Oct 7, 2015 · 12 comments
Closed

test-child-process-fork-regr-gh-2847 - fails on pLinux BE and AIX #3245

mhdawson opened this issue Oct 7, 2015 · 12 comments
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.

Comments

@mhdawson
Copy link
Member

mhdawson commented Oct 7, 2015

https://ci.nodejs.org/job/iojs+pr+ppc/nodes=ppcbe-fedora20/21/console

Failure on pLinux in CI https://ci.nodejs.org/job/iojs+pr+ppc/nodes=ppcbe-fedora20/21/console:

not ok 43 test-child-process-fork-regr-gh-2847.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: connect ECONNREFUSED 127.0.0.1:12346
#    at Object.exports._errnoException (util.js:874:11)
#    at exports._exceptionWithHostPort (util.js:897:20)
#    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1060:14)
  ---
  duration_ms: 2.612

I see the AIX failures in our internal builds but not the one on pLinux

not ok 43 test-child-process-fork-regr-gh-2847.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: connect ECONNREFUSED 127.0.0.1:13000
#    at Object.exports._errnoException (util.js:874:11)
#    at exports._exceptionWithHostPort (util.js:897:20)
#    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1060:14)
  ---
  duration_ms: 2.460

@mscdex mscdex added child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. labels Oct 8, 2015
@jbergstroem
Copy link
Member

@mhdawson you've probably already checked -- but in case you didn't, have a look and see if there are trailing processes from prior runs. It's the most common cause of ECONNREFUSED.

@jbergstroem
Copy link
Member

@gireeshpunathil you'd see both; one being workers not being able to connect and one being master not being able to spawn. Not saying this is the issue here though.

@gireeshpunathil
Copy link
Member

@jbergstroem, I believe trailing processes from prior runs would case EADDRINUSE, not ECONNREFUSED.

The scenario this test case covers is a cluster topology, wherein the worker exits on the first message from the master, and the master, sends a dummy message to the worker after connectivity is established with a server, which is done 100 times back-to-back. The intention is to make sure that the queued up data in the channel causes the process.disconnect() to wait until they are flushed.

I believe there are two issues with the test case:

  1. It assumes that all 100 messaged are dispatched before the worker recieves its first message. If this does not happen due to timing differences, worker.send() call will fail, as the worker is dead when some messages are still being dispatched:

#Error: channel closed
at ChildProcess.target.send (internal/child_process.js:510:16)
at Worker.send (cluster.js:47:21)
at Socket. (test/parallel/test-child-process-fork-regr-gh-2847.js:27:14)
at Socket.g (events.js:260:16)
at emitNone (events.js:67:13)
at Socket.emit (events.js:166:7)
at TCPConnectWrap.afterConnect as oncomplete

2.It assumes that when the first message is received in the worker and subsequently it's close callback is invoked in the master, next message is already dispatched by the master. If this does not happen due to timing differences, the server gets closed, and net.connect() will fail, as the server is already closed:

Error: connect ECONNREFUSED 127.0.0.1:12346
at Object.exports._errnoException (util.js:837:11)
at exports._exceptionWithHostPort (util.js:860:20)
at TCPConnectWrap.afterConnect as oncomplete

@gireeshpunathil
Copy link
Member

@jbergstroem, thanks for the quick reply. I confirmed that is not the case here with a clean run, that too with new, free port number.

Also, with a bit of timing change (changing the for loop to a setInterval or delaying the second worker.send call etc.) I am able to reproduce these two errors in Linux IA32 as well.

@mhdawson
Copy link
Member Author

Gireesh could you add an update of your investigation so far

@gireeshpunathil
Copy link
Member

Further debugging revealed the following:

  1. AIX loopback socket is slower than that of Linux. Only a handful of requests are made before the worker gets the first message, and subsequently shutdown itself.
  2. If I use IP address instead of localhost, the test passes consistently. All the 100 requests are dispatched before the worker gets the first message.
  3. I am not able to reproduce the error in 1000 iterations in PLinux. I will run for some more time to make any inference.
  4. However, as mentioned in the previous comment, the test case depends on strict timing assumptions based on Linux behavior, pLinux failure can be inferred as occurred under heavy load.
  5. If I change the tight loop to setInterval, the ECONNREFUSED failure is reproducible in all platforms.
  6. If I add a slight delay inside the send function for subsequent sends, the "channel closed" error is reproducible in all platforms.

I propose these changes to the test:

  1. Install an error handler on the connected socket, and ignore error if any.
  2. Implement a callback for the message send function, and absorb error if any.
  3. To make sure at least initial few connections go through successfully, use a counter.

Please let me know about this approach. If agree, I will come up with a PR.
@indutny , request you to share your thoughts.

@mhdawson
Copy link
Member Author

It seems to fail on PPC BE consistently in the CI

https://ci.nodejs.org/job/node-test-commit-plinux/17/nodes=ppcbe-fedora20/console

not ok 45 test-child-process-fork-regr-gh-2847.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: connect ECONNREFUSED 127.0.0.1:12346
#    at Object.exports._errnoException (util.js:889:11)
#    at exports._exceptionWithHostPort (util.js:912:20)
#    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1063:14)
  ---
  duration_ms: 2.712

@mhdawson
Copy link
Member Author

The suggested approach seems reasonable. Validating that at least some number of request get through will catch the case when we might otherwise ignore real errors.

Gireesh if you can put together a pull request or even just a branch that includes your changes we can start node-test-commit-plinux pointed at that in order to see if it resolves the problems seen on PPC in the CI runs

@gireeshpunathil
Copy link
Member

@mhdawson , thanks. Here is the test in a branch with the proposed modification for your validation in CI. A pull request is on its way.
https://github.com/gireeshpunathil/node/blob/clustermasterworkersync/test/parallel/test-child-process-fork-regr-gh-2847.js
/cc @indutny

@mhdawson
Copy link
Member Author

CI run here L https://ci.nodejs.org/job/node-test-commit/883/

Shows that failure is resolved for PPC, there are some issues with formatting reported by the linter though. You should be able to catch these with a complete make test run and fix up when putting together the PR

The other CI failures look unrelated to the test change to me.

@mhdawson
Copy link
Member Author

Since Gireesh is on vacation this week went ahead and resolved lint issues and created PR #3459

@mhdawson
Copy link
Member Author

PR closed, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

4 participants