-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster crash in main process, callback undefined when handling internal messages #3072
Comments
More details: This crash has happened twice yesterday and three times today, on the main server of Pokémon Showdown, which gets approximately 100,000 users/day and 10,000 concurrent users at peak. The crashes don't happen at peak, though, they seem to happen randomly (I don't know if there's a DoS attacker involved). On the other hand, our smaller server (which gets closer to 1000 users/day and 100 concurrent users) hasn't crashed a single time, so this is most likely an issue that only occurs at scale. The architecture uses |
/cc @bnoordhuis, blame suggests 41b75ca as a starting point for this |
If it works with v2.3.3, then it's unlikely that 41b75ca is the culprit, that dates back to v0.11. I don't spot any glaring bugs in lib/cluster.js. If you have a test case or some way for me to reproduce, that would certainly help. |
Needless to say, I've not seen this crash outside of production, where the server with 10k concurrent users had five crashes at seemingly-random times, while the server with 100 concurrent users did not crash at all. I don't have any more information to offer you, although I would be willing to add whatever debug code you'd like to the production server and wait for it to crash again. |
Maybe this? diff --git a/lib/cluster.js b/lib/cluster.js
index 602cc8d..bc1bfa1 100644
--- a/lib/cluster.js
+++ b/lib/cluster.js
@@ -690,7 +690,7 @@ var callbacks = {};
function sendHelper(proc, message, handle, cb) {
// Mark message as internal. See INTERNAL_PREFIX in lib/child_process.js
message = util._extend({ cmd: 'NODE_CLUSTER' }, message);
- if (cb) callbacks[seq] = cb;
+ if (cb) callbacks[seq] = cb, console.log('insert', seq);
message.seq = seq;
seq += 1;
proc.send(message, handle);
@@ -706,7 +706,9 @@ function internal(worker, cb) {
if (message.ack !== undefined) {
fn = callbacks[message.ack];
delete callbacks[message.ack];
+ console.log('delete', message.ack);
}
+ if (typeof fn !== 'function') console.trace(arguments);
fn.apply(worker, arguments);
};
} |
stdout:
(I have a crashguard running) stderr:
|
/cc @indutny - see the assert in src/stream_base.h. Sounds like cluster is just surfacing it. |
The TypeError and the assert in stream_base seem new. I'm using I'm still willing to add whatever other debug code you'd like to the production server and wait for it to crash again. |
@bnoordhuis this assertion can happen if cluster is emitting the same handle object twice, and the JS error, is rather odd too. |
True, but this change is not that major too. It looks like there is a socket created by a cluster, that does not have a handle on it. Before Rephrasing this a bit, I may be wrong, but it looks like 291b310 just revealed existing problem, not introduced a new one. |
I don't want to sound impatient or anything, but this bug is forcing me to stay on a really old version of iojs (even iojs 3.x crashes, although it's a different crash). I'm still willing to add whatever other debug code you'd like to the production server and wait for it to crash again. |
@bnoordhuis perhaps there might be a reason to investigate it further? |
Most certainly. How though? |
@bnoordhuis oops, didn't see your SCHED_POLICY comment. |
Using Crashed, after 18 hours (way more uptime than before, but I don't think that's relevant).
|
Just in case it's helpful - I've had this error in node versions 4.11, 4.12, 4.20 |
Got a similar (though different) error yesterday. Not sure if related. Please see: #3395 |
@bnoordhuis I've found a way to provoke this error consistently - Just start two instances of a node app on the same port (an app using cluster). Along with the expected EADDRINUSE error, this error appears. |
Haven't forgotten about this issue, it's next on my list. I don't think #3677 addresses this but it wouldn't hurt to try it if you have some spare cycles. |
I tried 5.1.0 (I assume that one has #3677) in it, and I also tried spawning more cluster processes (4 instead of 1, so ~3000 connections/process instead of ~12000), and it's still crashing for me. It managed to last multiple days of uptime instead of multiple hours, though! I think the process count had something to do with it. |
This worries me: https://nodejs.org/en/blog/vulnerability/cve-2015-8027_cve-2015-6764/ Is it possible to release a 2.x release that fixes this vulnerability? Newer versions crash Pokémon Showdown, and older versions are incompatible. |
@Zarel The v2.x release branch is abandoned. I'll be happy to point you to the commits after the v4.x / v5.x release so you can try to back-port them but that's where the supports ends. |
I was afraid I'd have to backport it... yes, that's probably the best solution in that case. |
Ping me after the release and I'll link you to the commits. |
@petermp The file descriptor limits on our servers are set to 200000 (hard and soft), and we are still regularly encountering this bug, so I don't think this is the cause. Perhaps a lower limit increases the chances of it happening. |
@bgSosh Yes, I recognized that. My encounter is one of the scenarios. I think the statement leading to the crash indicates the error leading to '_externalStream' crash. In my case, it was: _http_server.js:340 I remember running into the same type of crash with different leading error. |
getting quite a few of these as well
Not running with cluster module but am running pm2 in cluster mode. |
sometime have too this error on hi-load nodes with http and net
|
just in case I'll say that I stumbled upon the error:
while was emitting:
and the reason was circular data in variable "data". I found it by trying JSON.stringify() resulted in error. Fixed the problem by removing circular data. |
That's interesting. If you have a standalone test case (one that doesn't depend on third-party modules like socket.io), I'll be happy to take a look. |
it's a big project, I can't. Really sorry about that.
data here is a socket. Either Websocket (library "ws") or Socket (from callback of net.createServer()) |
I think this error is happening if you use worker.send with big messages under high load, at least it is gone since we stoped using the messaging part of cluster. |
I'm experiencing this crash reliably when opening thousands of connections to a Faye server: https://faye.jcoglan.com/ Edit: I run two faye servers combined with pm2. Edit2:
|
Same here on node v7.4.0 during highload tests with 24K connections
|
@nekrasoft Does it also happen with the latest v7? |
@bnoordhuis unfortunately, yes Exactly the same error,
|
I am not sure if this is the same bug, but hope it's gonna help somehow. I have tested it on "xnu-3789.51.2~3/RELEASE_X86_64" with node v6.10.2 and v7.9.0 with the same result. Stack trace:
Here is a code: Test With just one request from testClient.js it works fine. |
@Cmato I've been looking to your test case. The problem is that the undefined handle is indeed closed when sent to the worker. The reason may not seem evident but it's happening because how the the passing of handles is implemented: when sending multiple handles from one process to another, these are sent sequentially, so until a handle is not successfully sent to the target process (and this means sending the handle and receiving an acknowledgement), the next handle is not sent. In your test case, when the time comes for a specific handle (never the first, usually 3rd or the 4th) to be sent to the worker, it has already been closed thus the undefined handle. So in your code, in the message handler, you should take that case into account and check that the handle is indeed defined before trying to access to it. |
@santigimeno, it seems like the error message is still wrong in this case. If it's possible for a connection to have an undefined handle, it seems like something should be testing for that before it gets to the crash there. |
@Zarel @santigimeno a fix for the problem is here. I'm unsure about it because while it reduces crashes, it also makes failures more silent. The reason, as @santigimeno said, is that the connections are closing while waiting in the queue to be sent to the worker. That commit, combined with exposing |
@cjihrig Yes, I've already been running something similar on my server for years, but with more diagnostic information. I've been manually merging it into each version of Node since 4.1. Specifically: diff --git a/lib/_http_server.js b/lib/_http_server.js
index 46d0dbc..4fafa0f 100644
--- a/lib/_http_server.js
+++ b/lib/_http_server.js
@@ -321,11 +321,15 @@ function connectionListener(socket) {
socket.on = socketOnWrap;
// We only consume the socket if it has never been consumed before.
- var external = socket._handle._externalStream;
- if (!socket._handle._consumed && external) {
- parser._consumed = true;
- socket._handle._consumed = true;
- parser.consume(external);
+ if (socket._handle) {
+ var external = socket._handle._externalStream;
+ if (!socket._handle._consumed && external) {
+ parser._consumed = true;
+ socket._handle._consumed = true;
+ parser.consume(external);
+ }
+ } else {
+ console.trace("socket._handle was " + socket._handle);
}
parser[kOnExecute] =
onParserExecute.bind(undefined, this, socket, parser, state);
diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index da6ed28..79283ee 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -473,10 +473,14 @@ function setupChannel(target, channel) {
// There will be at most one NODE_HANDLE message in every chunk we
// read because SCM_RIGHTS messages don't get coalesced. Make sure
// that we deliver the handle with the right message however.
- if (message && message.cmd === 'NODE_HANDLE')
+ if (message && message.cmd === 'NODE_HANDLE') {
+ if (!recvHandle) {
+ console.error(new Error('recvHandle === ' + recvHandle).stack);
+ }
handleMessage(target, message, recvHandle);
- else
+ } else {
handleMessage(target, message, undefined);
+ }
}
jsonBuffer = incompleteChunk;
this.buffering = jsonBuffer.length !== 0;
diff --git a/lib/internal/cluster/utils.js b/lib/internal/cluster/utils.js
index ba72ff9..50b1c93 100644
--- a/lib/internal/cluster/utils.js
+++ b/lib/internal/cluster/utils.js
@@ -39,6 +39,10 @@ function internal(worker, cb) {
delete callbacks[message.ack];
}
- fn.apply(worker, arguments);
+ if (fn) {
+ fn.apply(worker, arguments);
+ } else {
+ console.error("Duplicate ack for " + JSON.stringify(message));
+ }
};
} This doesn't actually fix the bug, though; it only prevents it from crashing the server. |
@cjihrig Why is a fix needed? Wouldn't checking the existence of the handle on reception suffice? |
@santigimeno yea, it does suffice. It still seems like buggy behavior though. |
Can somebody check if #13235 fixes the issue? |
Hi, i have this error (like errors described above):
Im doing a stress test to my node app, trying to open 100K concurrent connections (using external clients from another servers) but this error occurs with 60K connection and my app crash. Im using cluster module and use fork method. The super weird thing is... this mistake started today ... What can i do? Help. |
Try to increase the max number of file descriptors in your system to higher than 60K to see if it goes away. |
@petermp Thanks!, thats works. |
I'm fairly sure #13235 fixed this. I'll close out the issue. |
🎉 I haven't encountered this error in a while. I'm glad it's finally solved. |
Found this issue in production on 4.1.1:
I am unable to create a reproducible test case, and we can only get this error to occur at scale. Works in v2.3.3, the previous version we were using.
My take on this: if you look at the error line, this can only happen if
cb
was undefined, or if there is no entry incallbacks
formessage.ack
. The first case is not possible ascb
is alwaysonmessage
. So that leaves the second case. One idea is that a message somehow gets duplicated (maybe a bug in libuv?), which would guarantee that there be no entry incallbacks
as line 708 would delete the entry the first time the message was received.The text was updated successfully, but these errors were encountered: