-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in 'connect' state with Cluster and Node 10.x #633
Comments
This is the most minimal repro I can get. Reproduced on Node 10.1.0 and 10.3.0 with Ioredis 3.2.2 with the following script, which establishes 3 cluster connections, burns CPU for a few seconds, then tries keys at intervals. It frequently stalls out under Node 10, but always succeeds under 9.11.1. I can't repro without a cluster.
|
It's slightly harder to get it to stall out with a single connection (dialing down nRedis to 1), but after a few runs of that on Node 10.3.0 and IORedis debug enabled, I get:
|
On a healthy run (this is with Node 10.3.0 as well, but just on a run where we don't see the intermittent failure), we see the cluster get into the ready state instead of stalling out in 'connect':
|
This is the most minimal simulation I have been able to generate for what we're seeing for the good and bad cases in our actual application in staging on Node 10 and Cluster 3.2.2. This may boil down to a Node bug or some kind of improper usage on our end, but it seems like understanding what's causing the cluster to get stuck in 'connect' as the cluster(info) command goes through and the redis goes to 'ready' is the beginning, and that's where I'm having some trouble. |
Just an update, I reproduced the issue with https://hub.docker.com/r/grokzen/redis-cluster/ and node 10.4.0 today, just copied over the script with
then from within the container did
which all succeeded as expected, then did
which broke on the first run with the expected pattern. That's with config changes for the Docker cluster, so redisTest.js is
|
Unsurprisingly (this was the most suspicious-looking change in the changelog for this), this traces back to nodejs/node@9b7a6914a7. If you build Node at the previous revision (nodejs/node@74553465e6), this test does not repro, at 9b7a6914a7f0bd754e78b42b48c75851cfd6b3c4 it does. It looks like there is some hubbub around 'close' events in the subsequent Node 10 changelogs, but concerning http, e.g. nodejs/node#20941. |
So you're suspecting that this is really a node issue and not an ioredis issue? Would probably to be good to log an issue there as well. Try to get some of those folks to re-review that commit. This issue is going to prevent folks using ioredis to upgrade to Node 10. Though perhaps there was some breaking change that ioredis needs to handle. |
Yes, I'm suspecting that Node isn't emitting 'close' under certain circumstances, and that that is a bug with that commit, but it'll probably take me a little while to boil it down to something that takes ioredis out of it. Currently we can work around this for the purposes of this test by handling 'end' rather than close
... but that seems very wrong, the 'close' handling seems more correct to me, and it looks as though that event should always be emitted. |
Ah, ok, if the socket is paused there may not be a 'close', that's why it's semver-major: |
... but at the point when we get the EOF (nodejs/node@9b7a6914a7#diff-e6ef024c3775d787c38487a6309e491dL645), the connection isPaused() returns false, so it's not that. The issue is that we get into onReadableStreamEnd with the socket's this.writable set to true, so the socket's maybeDestroy doesn't call socket.destroy and the socket never emits closed. It's a little hard to tell if that's intentional or not, from the PR and checkin, but it's looking more like 'not'. |
There is more on what exactly it looks like is happening with the socket at nodejs/node#19241 (comment) -- the long and short of it appears to be that if we hit the connect timeout and call .end before the socket is connected, the socket gets in a state that causes it to never call .destroy and thus never emit |
This will be fixed when this ships on Node 10: nodejs/node#21290. I'll leave this open until then in case anyone else sees it. |
Great investigation work! |
@ccs018 Thanks! The fix is out with Node 10.5.0, so I'll close this. |
I'm seeing ioredis 3.2.2 intermittently getting indefinitely stuck in the 'connect' state when run with Redis.Cluster under all versions of Node 10 up to the latest (10.3.0), and not under previous major versions (e.g. 9.11.1). I'm working on a minimal repro to illustrate this, but right now it looks as though the Node 10 stream is not emitting a 'close' event to trigger
eventHandler.closeHandler(_this)
under Node 10.I'm not yet sure why this happens, but I thought I'd give a heads-up as I look at it in case anyone else is seeing the same.
The text was updated successfully, but these errors were encountered: