Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy #368

slai11 · 2024-08-08T08:48:35Z

Issue

A small but non-trival percentage of Redis::Cluster::NodeMightBeDown is seen on some of my Sidekiq jobs. I understand that this error is raised in the find_node method after 3 retries where @node.reload! is called on each retry in an attempt to fix the @topology.clients and @slots hash.

Setup

For context, the affected Redis server is a 3-shard Redis Cluster. But looking at the observability metrics, we are fairly confident that the cluster state was healthy during the incident window (~2 hours). If the cluster state were unhealthy, the impact would have been more much severe.

We also configure REDIS_CLIENT_MAX_STARTUP_SAMPLE=1.

Part of the stack trace:

[redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:201:in `rescue in find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:198:in `find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:159:in `assign_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:60:in `send_command'
 redis-cluster-client (0.8.2) lib/redis_client/cluster.rb:35:in `call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `block in call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:104:in `handle_errors'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `call_v'
 redis (5.2.0) lib/redis.rb:152:in `block in send_command'
 redis (5.2.0) lib/redis.rb:151:in `synchronize'
 redis (5.2.0) lib/redis.rb:151:in `send_command'
 redis (5.2.0) lib/redis/commands/strings.rb:95:in `set'

I'm running on the following gem versions

redis-cluster-client on v0.8.2
redis-client on v0.22.2
redis on v5.2.0

Investigation details

We observed an increase in incoming new TCP connection on 1 of the 3 VMs containing a master redis-server process. This would match the 3 @node.reload! retries which would open a new connection for calling CLUSTER NODES and close it thereafter.

I've ruled out server-side network issues since the redis-client would raise a ConnectionError when that happens. I've verified this while attempting reproduce this locally with a Redis-Cluster setup configured with very low maxclients and tcp-backlog values. I ended up with RedisClient::Cluster::InitialSetupError when trying to reload the nodes.

I've been unable to locally reproduce this behaviour (will update when I do). The client suggests that the server is down but the server seems fine. Could there be a form of server response that could lead to an incomplete client/slot map?

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

Linking issue for reference: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3715

The text was updated successfully, but these errors were encountered:

supercaracal · 2024-08-09T02:31:04Z

Could there be a form of server response that could lead to an incomplete client/slot map?

I think the setting of REDIS_CLIENT_MAX_STARTUP_SAMPLE=1 has a trade-off between the load of servers and the reliability of the cluster state information. When a node temporarily has a stale state, our client may ends up in a mess by that node. So I configured the environment variable to a large size value on our test cases come along with unhealthy states.

ci: configure REDIS_CLIENT_MAX_STARTUP_SAMPLE for flaky test cases #245

redis-cluster-client/test/test_against_cluster_broken.rb

Lines 84 to 93 in 628276b

    
           def wait_for_cluster_to_be_ready(wait_attempts:) 
        
             loop do 
        
               break if wait_attempts <= 0 || @client.call('PING') == 'PONG' 
        
             rescue ::RedisClient::Cluster::NodeMightBeDown 
        
               # ignore 
        
             ensure 
        
               wait_attempts -= 1 
        
               sleep WAIT_SEC 
        
             end 
        
           end

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

I experienced that our CI was flaky when nodes replied inconsistent responses by CLUSTER NODES command. But it was occured almost on start-up phases.

Was there possibility that something was being happened in your cluster bus even if the state was healthy? Or there may be some bugs in our client. I'll look into it later. Unfortunately, since I use redis gem v4 in my work, I have no experience to use redis-cluster-client gem with a long-running cluster in a production environment.

supercaracal · 2024-08-09T10:49:50Z

slai11 · 2024-08-12T08:15:18Z

We found the redis cluster service ip is the same as the one of the pod ip (lead node). When the request was sent to this pod, it was actually sent to the service, which might redirect to any of the pods.

Thanks for sharing the links. I found the first one during initial stages of investigation too but it is not entirely relevant as we deploy the Redis cluster in VMs with fixed dns name (no chance of the service/pod IP confusion).

Was there possibility that something was being happened in your cluster bus even if the state was healthy?

It could be possible. For context, the cluster's cluster-node-timeout is set to 5000ms. The actual production Redis Cluster has 3 shards, each containing 1 primary and 2 secondary, totalling 9 VMs and redis-server processes (apologies for the discrepancy above, I wanted to keep the issue description simpler). We observed that for 8 of the 9 redis-server processes, there was a drop in rate of pings received and pongs sent by ~12% or 1/9. More details in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3714#note_2040420223 if you are curious.

I suspected what could have happened for us:

During RedisClient::Cluster initialisation, there was a 1/3 chance of sending CLUSTER NODES to the affected master redis-server process which returns a complete non-error response but with an unknown amount of fail? flags. Those rows get ignored when parsing the CLUSTER NODES response.
The resulting @topology.clients map is incomplete since the node info list is incomplete. Also, @slots would be incomplete.
Subsequent recovery through retrying was not successful since the sample of 1 always selects the same instance to fetch an updated CLUSTER NODES state.

Note: we lost the metrics and logs for the affected VM unfortunately, so there is some inference to be done here.

In any case, I think the lesson here is to sample more master nodes and be mindful of the trade-off since users could see a spike in Redis command traffic during deployments.

KJTsanaktsidis · 2024-09-25T07:47:14Z

redis-cluster-client/lib/redis_client/cluster/router.rb

Lines 200 to 208 in 1967399

    
           def find_node(node_key, retry_count: 1) 
        
             @node.find_by(node_key) 
        
           rescue ::RedisClient::Cluster::Node::ReloadNeeded 
        
             raise ::RedisClient::Cluster::NodeMightBeDown if retry_count <= 0 
        
             retry_count -= 1 
        
             @node.reload! 
        
             retry 
        
           end

I think this is the wrong place for this retry to happen - actually the retry needs to happen one level up, in assign_node -

redis-cluster-client/lib/redis_client/cluster/router.rb

Lines 159 to 162 in 1967399

    
           def assign_node(command) 
        
             node_key = find_node_key(command) 
        
             find_node(node_key) 
        
           end

So that after refreshing the cluster topology with @node.reload! (that actually asks the randomly-selected primaries for CLUSTER NODES and CLUSTER SLOTS), we actually re-run the slot -> node assignment so that it doesn't keep trying to send commands to the dead node.

supercaracal · 2024-09-25T09:07:22Z

Thank you for your reporting. That's definitely right. I think it's a bug. I'll fix later.

supercaracal · 2024-09-26T23:45:33Z

@slai11 I've fixed the behavior related to this issue. One is the mitigation for the frequency of queries with CLUSTER NODES command, and another one is the enhancement for the recoverability from the state of cluster down.

slai11 · 2024-09-27T04:38:33Z

@supercaracal thank you for the improvements! I think this issue can be closed for now (I'm not sure what is your workflow for that).

It is hard to reproduce the events of the incident separately to validate the fixes. But I'll update if we do encounter it again 👍

supercaracal · 2024-09-27T05:25:33Z

Feel free to reopen this issue when happening again.

supercaracal self-assigned this Aug 8, 2024

supercaracal mentioned this issue Sep 23, 2024

perf: lessen reload frequency to mitigate load of servers #377

Merged

This was referenced Sep 25, 2024

fix: ensure recoverability from cluster state changes #379

Merged

build: release 0.11.1 #380

Merged

fix: ensure recoverability from cluster down for pubsub and transaction #381

Merged

build: release 0.11.2 #382

Merged

This was referenced Sep 27, 2024

fix: ensure recoverability for pubsub #383

Merged

build: release 0.11.3 #384

Merged

supercaracal closed this as completed Sep 27, 2024

This was referenced Sep 27, 2024

fix: insufficient error handling for cluster down #385

Merged

build: release 0.11.4 #386

Merged

test: add a case for the recoverability from cluster down #387

Merged

test: fix a flaky test case #389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy #368

Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy #368

slai11 commented Aug 8, 2024 •

edited

Loading

supercaracal commented Aug 9, 2024 •

edited

Loading

supercaracal commented Aug 9, 2024 •

edited

Loading

slai11 commented Aug 12, 2024

KJTsanaktsidis commented Sep 25, 2024

supercaracal commented Sep 25, 2024

supercaracal commented Sep 26, 2024

slai11 commented Sep 27, 2024

supercaracal commented Sep 27, 2024 •

edited

Loading

Client is raising Redis::Cluster::NodeMightBeDown when Redis Cluster is observed to be healthy #368

Client is raising Redis::Cluster::NodeMightBeDown when Redis Cluster is observed to be healthy #368

Comments

slai11 commented Aug 8, 2024 • edited Loading

Issue

Setup

Investigation details

supercaracal commented Aug 9, 2024 • edited Loading

supercaracal commented Aug 9, 2024 • edited Loading

slai11 commented Aug 12, 2024

KJTsanaktsidis commented Sep 25, 2024

supercaracal commented Sep 25, 2024

supercaracal commented Sep 26, 2024

slai11 commented Sep 27, 2024

supercaracal commented Sep 27, 2024 • edited Loading

Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy #368

Client is raising `Redis::Cluster::NodeMightBeDown` when Redis Cluster is observed to be healthy #368

slai11 commented Aug 8, 2024 •

edited

Loading

supercaracal commented Aug 9, 2024 •

edited

Loading

supercaracal commented Aug 9, 2024 •

edited

Loading

supercaracal commented Sep 27, 2024 •

edited

Loading