Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: ignore spurious 'EMFILE' #12698

Merged
merged 1 commit into from
May 19, 2017
Merged

test: ignore spurious 'EMFILE' #12698

merged 1 commit into from
May 19, 2017

Conversation

refack
Copy link
Contributor

@refack refack commented Apr 27, 2017

As discussed in most recently on #10286 test/parallel/test-child-process-fork-regr-gh-2847.js is a little flaky
Last sighting was

c:\workspace\node-test-binary-windows\RUN_SUBSET\1\VS_VERSION\vcbt2015\label\win10\test\parallel\test-child-process-fork-regr-gh-2847.js:61
            throw err;
            ^

Error: write EMFILE
    at exports._errnoException (util.js:1026:11)
    at ChildProcess.target._send (internal/child_process.js:663:20)
    at ChildProcess.target.send (internal/child_process.js:547:19)
    at Worker.send (internal/cluster/worker.js:54:28)
    at Socket.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\1\VS_VERSION\vcbt2015\label\win10\test\parallel\test-child-process-fork-regr-gh-2847.js:33:14)
    at Object.onceWrapper (events.js:312:19)
    at emitNone (events.js:105:13)
    at Socket.emit (events.js:207:7)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1102:10)

Possible explanation #3635 (comment)
So as a follow up to #5422 I suggest we also ignore EMFILE raised from the connection.

Ref: #3635
Ref: #5178
Ref: #5179
Ref: #4005
Ref: #5121
Ref: #5422
Ref: #12621 (comment)
Fixes: #10286

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Apr 27, 2017
@refack
Copy link
Contributor Author

refack commented Apr 27, 2017

@Trott
Copy link
Member

Trott commented Apr 27, 2017

@nodejs/testing

Trott
Trott previously requested changes Apr 27, 2017
Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a reasonable explanation as to why EMFILE is being raised, I'm not thrilled with the idea of adding it just to make the test pass. The other errors listed make sense to me based on the comment above, but EMFILE seems surprising. This could just be my own ignorance....

@refack
Copy link
Contributor Author

refack commented Apr 27, 2017

Sorry, I missed adding that reference: #3635 (comment)
P.S. The test passes most of the time without this change, so IMHO it's not forcing it to pass, but making is less flaky.

Copy link
Member

@gibfahn gibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc/ @nodejs/platform-windows

@@ -37,7 +37,8 @@ const server = net.createServer(function(s) {
// is still happening while the server has been closed.
s.on('error', function(err) {
if ((err.code !== 'ECONNRESET') &&
((err.code !== 'ECONNREFUSED'))) {
(err.code !== 'ECONNREFUSED') &&
(err.code !== 'EMFILE')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@refack could you put a brief two line explanation of why this error occurs in a comment above? When reading through tests mysteriously ignored errors are always a bit of code smell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.
PTAL

@mscdex mscdex added the child_process Issues and PRs related to the child_process subsystem. label Apr 27, 2017
@refack
Copy link
Contributor Author

refack commented Apr 27, 2017

@Trott @gibfahn
I tried to document both the original purpose of this test, and the possible causes for the "EMFILE".
test/parallel/test-child-process-fork-regr-gh-2847.js has a very long history of being flaky.
As far as I can tell it's original purpose is to generate a very specific condition that pre #2847 used to cause a segfault, because this is based on asynchronous tasks, sometimes we just hit spurious connection errors.
It specific error was last seen on Windows, but I don't know if this is a Windows only scenario.

@refack
Copy link
Contributor Author

refack commented Apr 27, 2017

This exclusion would be a continuation to #5422
/cc @santigimeno

@Trott
Copy link
Member

Trott commented Apr 27, 2017

Let's see if a stress test can reproduce the error (or other errors) and if load on the machine matters at all. If the answer here is "move to sequential", I'd rather do that than have a bunch of exceptions in the code. Maybe we can even remove an exception or two if we go that route. Or not. Let's see...

Stress test on win10 with just one test running at a time:
https://ci.nodejs.org/job/node-stress-single-test/1180/nodes=win10/console

Stress test on win10 with 100 tests running at a time:
https://ci.nodejs.org/job/node-stress-single-test/1181/nodes=win10/console

@Trott
Copy link
Member

Trott commented Apr 28, 2017

Stress tests seem to show that EMFILE happens under load. Looks like moving the test to sequential may be the way to go therefore? Although I'd be interested in understanding why load causes EMFILE here. Maybe it's obvious once I look at the test, but I gotta log off right now....

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

Maybe windows is actually running out of file handles?
More probably high load causes the scheduler to be more aggressive, hence races.
Is there a way to test is sequentially but under external synthetic load?

IMHO if we want to just keep the original intention of the test (protect from regressing #2847) we could ignore all errors, except a SEGFAULT. If you think this test is useful for covering other cases, then it's worth the extra time...

@Trott
Copy link
Member

Trott commented Apr 28, 2017

Hope I'm not being presumptuous or anything, but on my own fork/branch: I removed all the guard code except the one that I"m sure is needed, moved to sequential, opened #12713 but labeled it in progress and started a CI stress test using the ALL label at https://ci.nodejs.org/job/node-stress-single-test/1182/ to see if any other guard code needs to be restored.

I'm in favor of having this be specific to that regression. I like my tests narrowly focused, especially ones designed for oddball regressions. :-D

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

Why not have them both, one sequential and unfiltered, and one parallel with the filters? All well documented of course.

@gibfahn
Copy link
Member

gibfahn commented Apr 28, 2017

Why not have them both, one sequential and unfiltered, and one parallel with the filters? All well documented of course.

Why would we? Moving a test to sequential just means it can't be run in parallel, which is a minor annoyance as it makes our test runs slightly slower. Duplicating the test seems unnecessary.

Is there a way to test is sequentially but under external synthetic load?

I don't think we have a built in way to do that, I normally just build Node in the background, seems to do the trick (not the Node I'm testing obviously 😁 ). Maybe running benchmarks at the same time would provide enough load?

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

@gibfahn sorry if I'm not RTFM but how do I trigger a stress test (like on the CI) locally?

@gibfahn
Copy link
Member

gibfahn commented Apr 28, 2017

@refack tools/test.py --repeat 1000 parallel will run the parallel test suite 1000 times.

Probably worth doing something like:

tools/test.py --repeat 1000 -p tap parallel/test-child-process-fork-regr-gh-2847 | tee out.tap

and then grepping for not ok.

You could use Powershell's Tee-Object on Windows.

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

Why would we? Moving a test to sequential just means it can't be run in parallel, which is a minor annoyance as it makes our test runs slightly slower. Duplicating the test seems unnecessary.

@gibfahn: @Trott's premise is that in sequential mode the exception exclusions are unnecessary (#12713). Although they are obviously needed in parallel mode. So having them tested in both modes tests this scenario under two different conditions. "Can't have too much testing"?
Obviously these reasoning should be well documented within the test.

P.S. @Trott you had a typo (sequential}/test-child-process-fork-regr-gh-2847) so I'm re-runing: https://ci.nodejs.org/job/node-stress-single-test/1186/

@gibfahn
Copy link
Member

gibfahn commented Apr 28, 2017

"Can't have too much testing"?

I disagree. Every test should be testing something useful, otherwise you end up with a huge backlog of tests, and it's not sure what a test is actually for or why it failed.

So having them tested in both modes tests this scenario under two different conditions.

But EMFILE is only thrown very occasionally, and by ignoring this error as well we're not actually testing the EMFILE exception. If we think we should see this EMFILE error under some conditions, then we should try to get a test that reliably reproduces it, and the test for that probably doesn't need to also test this regression.

@Trott
Copy link
Member

Trott commented Apr 28, 2017

@refack tools/test.py --repeat 1000 parallel will run the parallel test suite 1000 times.

To stress test a parallel test locally, I also use -j to specify how many tests to run simultaneously.

$ tools/test.py --repeat=92 -j 92 parallel/test-foo-bar

@Trott Trott dismissed their stale review April 28, 2017 14:53

Just got EMFILE in a sequential run, and reading the explanation linked above, it seems to be an unexpected if infrequent thing to happen in this test. Clearing my objection.

@Trott
Copy link
Member

Trott commented Apr 28, 2017

Argh, can't edit the above (I don't think) but I meant to say expected if infrequent rather than unexpected if infrequent.

@@ -35,9 +40,14 @@ const server = net.createServer(function(s) {

// Errors can happen if this connection
// is still happening while the server has been closed.
// https://github.com/nodejs/node/issues/3635#issuecomment-157714683
// If a previous request caused the worker __and__ the server to close
// a following "send" request could emit EMFILE, this is a sporius
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sporius -> spurious

@@ -35,9 +40,14 @@ const server = net.createServer(function(s) {

// Errors can happen if this connection
Copy link
Member

@gibfahn gibfahn Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:
- ECONNREFUSED or ECONNRESET errors can happen if this connection

@@ -35,9 +40,14 @@ const server = net.createServer(function(s) {

// Errors can happen if this connection
// is still happening while the server has been closed.
// https://github.com/nodejs/node/issues/3635#issuecomment-157714683
// If a previous request caused the worker __and__ the server to close
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- If a previous request...

Basically can it be really clear which explanation is for which errors?

Copy link
Member

@gibfahn gibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM modulo nits

@@ -1,3 +1,7 @@
/* Description:
* Before #2847 a child process trying (asynchronously) to use a the closed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2847 -> https://github.com/nodejs/node/pull/2847

a the -> a

@@ -1,3 +1,7 @@
/* Description:
* Before #2847 a child process trying (asynchronously) to use a the closed
* channel to it's creator, would have caused a segfault.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to connect to ?

@@ -1,3 +1,7 @@
/* Description:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need the Description:

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

@gibfahn, commands addressed PTAL.

Copy link
Member

@santigimeno santigimeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit

// is still happening while the server has been closed.
// https://github.com/nodejs/node/issues/3635#issuecomment-157714683
// ECONNREFUSED or ECONNRESET errors can happen if this connection can
// happen if this connection is still happening while the server has closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can happen if this connection is repeated twice

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM with a couple of nits

* Before https://github.com/nodejs/node/pull/2847 a child process
* trying (asynchronously) to use the closed channel to it's creator
* would have caused a segfault.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment block should use the typical // style.

@@ -6,6 +11,7 @@ const assert = require('assert');
const cluster = require('cluster');
const net = require('net');


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change

@refack
Copy link
Contributor Author

refack commented Apr 28, 2017

Comments fixed, PTAL.

[joke]
You are all a bunch of nit-pickers
[/joke]
JK. Love you all 💌

Copy link
Member

@gibfahn gibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are all a bunch of nit-pickers

Speaking of which...

(I see it as a game, can you minimize the number of nits you'll get per PR)

@@ -1,3 +1,5 @@
// Before https://github.com/nodejs/node/pull/2847 a child process trying
// (asynchronously) to use the closed channel to it's creator caused a segfault.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's -> its

@@ -1,3 +1,5 @@
// Before https://github.com/nodejs/node/pull/2847 a child process trying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should be an issue, so:

https://github.com/nodejs/node/pull/2847->https://github.com/nodejs/node/issues/2847

Also can you order this as in the guide? Should look like:

'use strict';
const common = require('../common');

// Before https://github.com/nodejs/node/issues/2847 a child process trying
// (asynchronously) to use the closed channel to it's creator caused a segfault.
 
const assert = require('assert');

@refack refack force-pushed the more-for-10286 branch 3 times, most recently from ab5f12c to b372ff6 Compare April 28, 2017 23:48
According to the explanation in nodejs#3635#issuecomment-157714683
And as a continuation to nodejs#5422 we also ignore EMFILE
"No more file descriptors are available,so no more files can be opened"

PR-URL: nodejs#12698
Fixes: nodejs#10286
Refs: nodejs#3635 (comment)
Refs: nodejs#5178
Refs: nodejs#5179
Refs: nodejs#4005
Refs: nodejs#5121
Refs: nodejs#5422
Refs: nodejs#12621 (comment)
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@refack refack merged commit 6f21671 into nodejs:master May 19, 2017
@refack
Copy link
Contributor Author

refack commented May 19, 2017

Landed in 6f21671

@refack
Copy link
Contributor Author

refack commented May 19, 2017

@refack refack deleted the more-for-10286 branch May 19, 2017 19:33
@jasnell jasnell mentioned this pull request May 28, 2017
@gibfahn gibfahn mentioned this pull request Jun 15, 2017
3 tasks
MylesBorins pushed a commit that referenced this pull request Jul 17, 2017
According to the explanation in #3635#issuecomment-157714683
And as a continuation to #5422 we also ignore EMFILE
"No more file descriptors are available,so no more files can be opened"

PR-URL: #12698
Fixes: #10286
Refs: #3635 (comment)
Refs: #5178
Refs: #5179
Refs: #4005
Refs: #5121
Refs: #5422
Refs: #12621 (comment)
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@MylesBorins
Copy link
Contributor

I've landed this on v6.x... please lmk if this was a mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent parallel/test-http-regr-gh-2821 failure on V6.9.2 (Windows)
9 participants