ioredis does not detect dead connections #139

hcburger · 2015-08-27T16:58:36Z

I am using redis as a cache in a production environment. My redis-server is running on a different VM than my client running ioredis. My problem is that ioredis does not properly handle connections that are "lost" (by which I mean connections that are dead, but have not been properly closed). When this happens, the call-backs of all my .get, .set, etc... calls are only reached after about 11 minutes.

For example:

var Redis = require('ioredis');
var redis = new Redis();

redis.set('foo', 'bar');
// connection dies here.
redis.get('foo', function (err, result) {
  // Will only get here after ~11 minutes.
});

As you can imagine, this is unacceptable behavior for my cache as all the code accessing the cache will get stuck and all requests to my server will timeout.

The 'close', 'end' and 'error' events of ioredis are not called, because the tcp connection is not properly closed (no "FIN" tcp packets are sent). This is a common occurrence in a production environment and occurs in the following situations, among others:

the VM running redis-server is shut down
the NAT running between the client and the redis-server (e.g. a firewall or docker's native NATing) eliminates the connection because it has been inactive for some time.

This does not happen if you run the client and redis-server on the same machine and just kill (either SIGINT or SIGKILL) the redis-server, as the OS will perform some clean-up and properly close the tcp connections between the redis-server about to be turned off and all connected clients (the ioredis object receives a 'close' event).

Coming back to the first situation where the connection is dead because of e.g. a failure of the VM running redis-server. All ioredis get() requests hang. After about 11 minutes, the OS detects that the connection looks dead, and kills it (seems to be related to the TCP keepalive parameters). At that point, the 'close' event will indeed be fired, and the callbacks of the .get requests will be reached.

What I would like to be able to do is to set a parameter that tells ioredis to give up on the redis.get after a certain number of milliseconds and return an error if not successful. E.g.

var redis = new Redis({requestTimeout: 100});
redis.get('foo', function (err, result) {
  // Will get here after at most 100ms, even if the connection died.
});

This would allow me to quickly rely on a fall-back solution (avoid accessing the cache). However, as far as I can tell, this is currently not possible with ioredis.

luin · 2015-08-27T17:39:07Z

Haven't come across this issue before. I'm going to look into it the weekend. Currently you can set the timeout of a request by:

redis.get('foo').timeout(100).then(function() {}).catch(function() {});

Possibly related issue #61.

hcburger · 2015-08-28T09:48:24Z

I just used exactly what you suggested, and it works wonderfully. It's a really good solution to my problem.

May I suggest adding this to your Quick Start?

luin · 2015-08-28T09:53:16Z

redis.get('foo') returns a bluebird Promise. The timeout method is provided by bluebird, so I don't think we need to document it in ioredis's README. However I'd like to find a way to make timeout works with both promise and node-style callback.

AdriVanHoudt · 2015-08-28T13:09:32Z

having the same issue here where it just does nothing when it can't connect to the redis and I use callbacks so the timeout won't work for me 😞

luin · 2015-09-02T03:43:49Z

There are three parameters to control how the OS would detect a dead connection: tcp_keepalive_time, tcp_keepalive_intvl and tcp_keepalive_probes(http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html). Connections are considered as dead after there are tcp_keepalive_intvl * tcp_keepalive_probes seconds(defaults to 75s * 9 = 11min) not receiving the ACK response from the peer.

Obviously the default configuration are not reasonable for a Redis tcp connection, however, node.js only supports set the tcp_keepalive_time parameter(nodejs/node-v0.x-archive#4109).

I will add an option to ioredis to let users enable keepalive and specify tcp_keepalive_time. For the other two parameters, you can either set them at the system level, or just implement heartbeat on the client side(e.g. send PING command to the Redis every 2s, and close the client by redis.disconnect() when there's no response for a specified interval).

LeDominik · 2015-09-07T10:10:06Z

I'm using the promise-style and sadly it doesn't run into the catch() part if the connection is not possible (I simple quit redis). Working with bluebird's timeout helps, but this doesn't seem to be the right strategy as I cannot report the real reason...

luin · 2015-09-07T10:15:58Z

@LeDominik It may be because redis server didn't close the TCP connection before closing so the only way client can notice the dead connection would be the keepalive mechanism. Do the server and client run in the same machine? And how do you quit the redis?

LeDominik · 2015-09-07T10:19:16Z

@luin Well I did it the easy way. Same machine, OS X, switch tab, CTRL+C 😄

luin · 2015-09-07T10:22:01Z

@LeDominik It should because the client is keeping reconnecting to the Redis. Check out https://github.com/luin/ioredis#auto-reconnect

LeDominik · 2015-09-07T10:24:19Z

@luin so I basically need to modifiy the retry-policy to ensure that it gives up in a reasonable time like 1sec 😄

luin · 2015-09-07T10:25:41Z

@LeDominik Yup

LeDominik · 2015-09-07T11:34:01Z

Awesome, thanks @luin I went for 500msec

  retryStrategy: function (times) {
    if(times < 2) {
      console.warn('Redis: Connection retry # %d ', times);
      return 500;
    }
    return "";
  }

But this got me thinking: Basically it should keep retrying in the background to recover gracefully when redis / network is back up. Now once the connection is closed it's done and I can basically restart my little express app. It would be great to have a real per-operation timeout that would go into catch(err), tell me the real reason (no connection) and allow me to return a HTTP 500. Then the consumer of the API can choose whether to retry...

luin · 2015-09-07T12:10:31Z

Well, that makes sense. I'd like to add an option for this.

luin · 2015-09-09T08:53:05Z

Closing this issue since the keepAlive option is added in 1.8.0. Refer to #61 for any further updates about per-operation timeout.

luin closed this as completed Sep 9, 2015

xpiwo mentioned this issue Jan 31, 2016

how to detect dead subscribe connections? #240

Closed

sangelov mentioned this issue Jan 23, 2020

ETIMEDOUT error in Redis Sentinel, when Master loses network connectivity #1021

Closed

ZhangDianPeng mentioned this issue Jun 29, 2022

一个关于 redis 长链接的问题 #1615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ioredis does not detect dead connections #139

ioredis does not detect dead connections #139

hcburger commented Aug 27, 2015

luin commented Aug 27, 2015

hcburger commented Aug 28, 2015

luin commented Aug 28, 2015

AdriVanHoudt commented Aug 28, 2015

luin commented Sep 2, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

luin commented Sep 9, 2015

ioredis does not detect dead connections #139

ioredis does not detect dead connections #139

Comments

hcburger commented Aug 27, 2015

luin commented Aug 27, 2015

hcburger commented Aug 28, 2015

luin commented Aug 28, 2015

AdriVanHoudt commented Aug 28, 2015

luin commented Sep 2, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

LeDominik commented Sep 7, 2015

luin commented Sep 7, 2015

luin commented Sep 9, 2015