Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client is raising Redis::Cluster::NodeMightBeDown when Redis Cluster is observed to be healthy #368

Closed
slai11 opened this issue Aug 8, 2024 · 8 comments
Assignees

Comments

@slai11
Copy link
Contributor

slai11 commented Aug 8, 2024

Issue

A small but non-trival percentage of Redis::Cluster::NodeMightBeDown is seen on some of my Sidekiq jobs. I understand that this error is raised in the find_node method after 3 retries where @node.reload! is called on each retry in an attempt to fix the @topology.clients and @slots hash.

Setup

For context, the affected Redis server is a 3-shard Redis Cluster. But looking at the observability metrics, we are fairly confident that the cluster state was healthy during the incident window (~2 hours). If the cluster state were unhealthy, the impact would have been more much severe.

We also configure REDIS_CLIENT_MAX_STARTUP_SAMPLE=1.

Part of the stack trace:

[redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:201:in `rescue in find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:198:in `find_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:159:in `assign_node'
 redis-cluster-client (0.8.2) lib/redis_client/cluster/router.rb:60:in `send_command'
 redis-cluster-client (0.8.2) lib/redis_client/cluster.rb:35:in `call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `block in call_v'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:104:in `handle_errors'
 redis-clustering (5.2.0) lib/redis/cluster/client.rb:85:in `call_v'
 redis (5.2.0) lib/redis.rb:152:in `block in send_command'
 redis (5.2.0) lib/redis.rb:151:in `synchronize'
 redis (5.2.0) lib/redis.rb:151:in `send_command'
 redis (5.2.0) lib/redis/commands/strings.rb:95:in `set'

I'm running on the following gem versions

  • redis-cluster-client on v0.8.2
  • redis-client on v0.22.2
  • redis on v5.2.0

Investigation details

We observed an increase in incoming new TCP connection on 1 of the 3 VMs containing a master redis-server process. This would match the 3 @node.reload! retries which would open a new connection for calling CLUSTER NODES and close it thereafter.

I've ruled out server-side network issues since the redis-client would raise a ConnectionError when that happens. I've verified this while attempting reproduce this locally with a Redis-Cluster setup configured with very low maxclients and tcp-backlog values. I ended up with RedisClient::Cluster::InitialSetupError when trying to reload the nodes.

I've been unable to locally reproduce this behaviour (will update when I do). The client suggests that the server is down but the server seems fine. Could there be a form of server response that could lead to an incomplete client/slot map?

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

Linking issue for reference: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3715

@supercaracal supercaracal self-assigned this Aug 8, 2024
@supercaracal
Copy link
Member

supercaracal commented Aug 9, 2024

Could there be a form of server response that could lead to an incomplete client/slot map?

I think the setting of REDIS_CLIENT_MAX_STARTUP_SAMPLE=1 has a trade-off between the load of servers and the reliability of the cluster state information. When a node temporarily has a stale state, our client may ends up in a mess by that node. So I configured the environment variable to a large size value on our test cases come along with unhealthy states.

def wait_for_cluster_to_be_ready(wait_attempts:)
loop do
break if wait_attempts <= 0 || @client.call('PING') == 'PONG'
rescue ::RedisClient::Cluster::NodeMightBeDown
# ignore
ensure
wait_attempts -= 1
sleep WAIT_SEC
end
end

To the maintainers, in your experience developing this client library, was this (small percentage of NodeMightBeDown with a seemingly healthy cluster) behaviour something that could have happened?

I experienced that our CI was flaky when nodes replied inconsistent responses by CLUSTER NODES command. But it was occured almost on start-up phases.

Was there possibility that something was being happened in your cluster bus even if the state was healthy? Or there may be some bugs in our client. I'll look into it later. Unfortunately, since I use redis gem v4 in my work, I have no experience to use redis-cluster-client gem with a long-running cluster in a production environment.

@slai11
Copy link
Contributor Author

slai11 commented Aug 12, 2024

We found the redis cluster service ip is the same as the one of the pod ip (lead node). When the request was sent to this pod, it was actually sent to the service, which might redirect to any of the pods.

Thanks for sharing the links. I found the first one during initial stages of investigation too but it is not entirely relevant as we deploy the Redis cluster in VMs with fixed dns name (no chance of the service/pod IP confusion).

Was there possibility that something was being happened in your cluster bus even if the state was healthy?

It could be possible. For context, the cluster's cluster-node-timeout is set to 5000ms. The actual production Redis Cluster has 3 shards, each containing 1 primary and 2 secondary, totalling 9 VMs and redis-server processes (apologies for the discrepancy above, I wanted to keep the issue description simpler). We observed that for 8 of the 9 redis-server processes, there was a drop in rate of pings received and pongs sent by ~12% or 1/9. More details in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3714#note_2040420223 if you are curious.

I suspected what could have happened for us:

  • During RedisClient::Cluster initialisation, there was a 1/3 chance of sending CLUSTER NODES to the affected master redis-server process which returns a complete non-error response but with an unknown amount of fail? flags. Those rows get ignored when parsing the CLUSTER NODES response.
  • The resulting @topology.clients map is incomplete since the node info list is incomplete. Also, @slots would be incomplete.
  • Subsequent recovery through retrying was not successful since the sample of 1 always selects the same instance to fetch an updated CLUSTER NODES state.

Note: we lost the metrics and logs for the affected VM unfortunately, so there is some inference to be done here.


In any case, I think the lesson here is to sample more master nodes and be mindful of the trade-off since users could see a spike in Redis command traffic during deployments.

@KJTsanaktsidis
Copy link
Contributor

def find_node(node_key, retry_count: 1)
@node.find_by(node_key)
rescue ::RedisClient::Cluster::Node::ReloadNeeded
raise ::RedisClient::Cluster::NodeMightBeDown if retry_count <= 0
retry_count -= 1
@node.reload!
retry
end

I think this is the wrong place for this retry to happen - actually the retry needs to happen one level up, in assign_node -

def assign_node(command)
node_key = find_node_key(command)
find_node(node_key)
end

So that after refreshing the cluster topology with @node.reload! (that actually asks the randomly-selected primaries for CLUSTER NODES and CLUSTER SLOTS), we actually re-run the slot -> node assignment so that it doesn't keep trying to send commands to the dead node.

@supercaracal
Copy link
Member

Thank you for your reporting. That's definitely right. I think it's a bug. I'll fix later.

@supercaracal
Copy link
Member

@slai11 I've fixed the behavior related to this issue. One is the mitigation for the frequency of queries with CLUSTER NODES command, and another one is the enhancement for the recoverability from the state of cluster down.

@slai11
Copy link
Contributor Author

slai11 commented Sep 27, 2024

@supercaracal thank you for the improvements! I think this issue can be closed for now (I'm not sure what is your workflow for that).

It is hard to reproduce the events of the incident separately to validate the fixes. But I'll update if we do encounter it again 👍

@supercaracal
Copy link
Member

supercaracal commented Sep 27, 2024

Feel free to reopen this issue when happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants