Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023

Open
srikar-jilugu opened this issue Jun 10, 2024 · 22 comments

Comments

@srikar-jilugu
Copy link

srikar-jilugu commented Jun 10, 2024

When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with goredis v9.5.1, we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.

moved, ask, addr = isMovedError(lastErr)

In further debugging we observed that slotClosestNode func is returning a random node across all the shards in the case when all the nodes are marked as failed.

return c.nodes.Random()

In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors

Expected Behavior

There shouldn't be increase in response time when Routebylatency enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.

Current Behavior

Increase in moved errors, hence increase in throughput of Get(with the same traffic), engine cpu utilisation of all the nodes and overall latency.

Possible Solution

In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when RouteRandomly is enabled.

Steps to Reproduce

  1. Elasticache cluster (we are using engine 7.1.0) with multiple shards and replicas for each( we used 2 shards with 3 nodes each)
  2. Using go-redis v9.5.1 with RoutebyLatency enabled, throughput around 10-20k rpm with get and pipeline.get
  3. Mulitple ecs tasks(we are using 10) running spread across multiple availability zones

Context (Environment)

Detailed Description

Possible Implementation

We made changes in the slotClosestNode func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again.

This is the fix we made in our fork.

	        for _, n := range nodes {
		     if n.Failing() {
			 continue
		     }
		if node == nil || n.Latency() < node.Latency() {
			node = n
		}
	        }

	       if node != nil {
		    return node, nil
	       }

                // If all nodes are failing - return random node from the nodes corresponding to the slot
	      randomNodes := rand.Perm(len(nodes))
	      return nodes[randomNodes[0]], nil
@srikar-jilugu
Copy link
Author

@vmihailenco, @monkey92t Can you please look into this issue?

@monkey92t
Copy link
Collaborator

monkey92t commented Jun 13, 2024

I believe using nodes that are marked as failed is not a good idea. In most cases, such nodes are unusable, and attempting to use them will result in a predictable error.

@monkey92t
Copy link
Collaborator

In cluster mode, during a node failover, such situations may occur: the hash slots originally handled by Node A are transferred to Node B. However, the latest cluster state is not retrieved in real time and has some delay.

While redirecting commands to the new node might bring certain issues, such as network connections, it still ensures the execution of commands. If we use a crashed node, it might not yield any effective results, and returning an error directly might be better.

@monkey92t
Copy link
Collaborator

From the perspective of the Redis server, it expects many clients to perform similar operations; otherwise, the MOVE command would be meaningless. Redis does not guarantee a permanent relationship between nodes and hash slots. When adding or removing nodes, the relationship between hash slots and nodes may change. Redis uses the MOVE command to inform the client that there has been a change in the hash slots within the Redis server. In normal cluster mode, such situations rarely occur because, in the vast majority of cases, the nodes in a Redis cluster are stable, or replicas are used to avoid single-node failures.

@srikar-jilugu
Copy link
Author

@monkey92t thanks for replying, we were having lot of "all nodes failing" cases when we were benchmarking, this was due to a bug in v9.5.1 where nodes are getting marked as failed even when getting redis: nil errors.(which is fixed in the latest release).

But in the situation where we are getting lot of moved errors, we were not having any shard rebalancing at all(or node failovers), the only root cause was slotClosestNode picking up nodes from other shards when all nodes are marked as failing. This is leading the client to reload its state(even though it isn't changed) and then coming back to a node in the current shard itself (giving it an extra hop) and executing queries on it (which was not picked in the first go as it was failing).

State reloading can be avoided if we choose a random node in the current shard itself and process function would avoid the node if its failing in the 1st retry too (its serving successfully with the retry in our case as failing is temporary)

On the other hand, slotRandomNode (wheRouteRandomly is enabled) already chooses a random node from the same slot even when all nodes are failing.

	randomNodes := rand.Perm(len(nodes))
	for _, idx := range randomNodes {
		if node := nodes[idx]; !node.Failing() {
			return node, nil
		}
	}
	return nodes[randomNodes[0]], nil

@srikar-jilugu
Copy link
Author

Also Is it possible to add a config to ignore failure of nodes? (i.e consider them while choosing lowest latency node)

@monkey92t
Copy link
Collaborator

OK, I am trying to understand your point. During your testing, there were no changes to the Redis cluster structure, but the node responsible for a certain range of hash slots experienced a failure (or network issue). As a result, the node corresponding to those hash slots became inaccessible. When go-redis encounters an error while accessing that node, it marks the node as faulty and stops accessing it. However, since all nodes responsible for the hash slots in the cluster have failed, the commands are randomly sent to any available node. Because the cluster structure hasn't changed, the commands are then redirected (MOVE) back to the original faulty node, and go-redis attempts to send the commands to the faulty node again.

@monkey92t
Copy link
Collaborator

Your solution might be effective, but it will bring about greater side effects. If we force the use of a node that has already failed, it becomes pointless. The reason we mark a node as faulty is to avoid using it until it recovers. If the Redis cluster is performing a normal failover, with a new node taking over the failed node, it will be difficult to discover the new node information without using MOVE, which is the purpose of Redis-server's MOVE response to the client. Additionally, when the node experiences a network failure, accessing it again will worsen an already unhealthy network state.

Go-redis does not know what is happening with Redis-server, whether it is a simple failure or an ongoing failover, so it is difficult to make certain changes. Perhaps we should carefully consider a better solution.

@monkey92t
Copy link
Collaborator

@srikar-jilugu You can try setting only the ReadOnly parameter; it will select nodes only within the nodes responsible for the hash slot and will not randomly select other nodes.

func (c *clusterState) slotSlaveNode(slot int) (*clusterNode, error) {
	nodes := c.slotNodes(slot)
	switch len(nodes) {
	case 0:
		return c.nodes.Random()
	case 1:
		return nodes[0], nil
	case 2:
		if slave := nodes[1]; !slave.Failing() {
			return slave, nil
		}
		return nodes[0], nil
	default:
		var slave *clusterNode
		for i := 0; i < 10; i++ {
			n := rand.Intn(len(nodes)-1) + 1
			slave = nodes[n]
			if !slave.Failing() {
				return slave, nil
			}
		}

		// All slaves are loading - use master.
		return nodes[0], nil
	}
}

In the above code, only when no corresponding node is found for the hash slot (regardless of whether the nodes are healthy), will a random node be selected. In all other cases, the node corresponding to the hash slot will be returned.

@monkey92t
Copy link
Collaborator

But doing so deviates from your expected setup; it won't look for nodes that are closer to itself, it will only randomly select a node.

@srikar-jilugu
Copy link
Author

@monkey92t we are opting in for RouteByLatency so that requests are served from the closest nodes which usually stay in the same AZ as our containers hence reducing latency and network cost as much as possible. If we look at the code for slotRandomNode (i.e when RouteRandomly is opted in), when all the nodes are failing, it selects a random node corresponding hashslot itself, we want the same behaviour for slotClosestNode too

@srikar-jilugu
Copy link
Author

srikar-jilugu commented Jun 14, 2024

OK, I am trying to understand your point. During your testing, there were no changes to the Redis cluster structure, but the node responsible for a certain range of hash slots experienced a failure (or network issue)

We weren't experiencing any node failure, we had lot of cache misses when we benchmarked using v9.5.1 which resulted in nodes getting marked as failed

func (c *ClusterClient) pipelineReadCmds(
	ctx context.Context,
	node *clusterNode,
	rd *proto.Reader,
	cmds []Cmder,
	failedCmds *cmdsMap,
) error {
	for i, cmd := range cmds {
		err := cmd.readReply(rd)
		cmd.SetErr(err)

		if err == nil {
			continue
		}

		if c.checkMovedErr(ctx, cmd, err, failedCmds) {
			continue
		}
                // here we are getting redisNil errors because of cache misses.(but this is fixed in v9.5.3)
		if c.opt.ReadOnly {
			node.MarkAsFailing()
		}

@srikar-jilugu
Copy link
Author

srikar-jilugu commented Jun 17, 2024

The reason we mark a node as faulty is to avoid using it until it recovers.

@monkey92t I agree that using a bad node can cause an issue, but the library is marking nodes as failed whenever there is a badconnection issue which can be caused by even single intermittent context timeout (and deadline exceeded error) leading to frequent failures. (possibly the logic on when the node should be marked as failed needs to change?)

Also Is it possible to add a config to ignore failure of nodes? (i.e consider them while choosing lowest latency node)

Can you let me know if this can be done, like a configurable option for RouteBylatency(or Readonly) so that we can ignore temporary failure markings like these?

@srikar-jilugu
Copy link
Author

srikar-jilugu commented Jun 17, 2024

However, since all nodes responsible for the hash slots in the cluster have failed, the commands are randomly sent to any available node.

We weren't doing this in any other node selection functions: slotRandomNode, slotSlaveNode,slotMasterNode, in all these methods we are choosing either the master or a random node within the same slot when all nodes are failing

@monkey92t
Copy link
Collaborator

I've recently been reflecting on the request flow in redis-cluster. It seems that we shouldn't randomly select another node to execute commands when all the nodes corresponding to a hash slot are down.

Similar to the ReadOnly parameter, when the nodes are offline, we still choose the node corresponding to the hash slot, rather than randomly selecting another node.

@monkey92t
Copy link
Collaborator

@vmihailenco Do you understand why we choose a random node?

@srikar-jilugu
Copy link
Author

@monkey92t @vmihailenco Is there a reason why RouteByLatency and RouteRandomly includes master node during selection? Would it be possible to have node selection among replicas itself (I think slotSlaveNode already covers it for random selection case, but would the same be possible for latency based selection?)

Use Case: We want to have RouteByLatency for our write heavy clients too, but the read requests should only be served by slaves

@srikar-jilugu
Copy link
Author

@monkey92t Any update on this issue? if we are onboard on the decision that random selection cannot be done outside the hash slot, will raise a pr for the same.

@monkey92t
Copy link
Collaborator

@vmihailenco @ofekshenawa view?

@srikar-jilugu
Copy link
Author

@monkey92t @vmihailenco @ofekshenawa Are there any hindrances we are facing regarding this approach?, would love to get your feedback.

@monkey92t
Copy link
Collaborator

I am in favor of canceling the random selection of nodes because there seems to be no clear evidence that a migration has occurred in the Redis cluster; it is merely that the nodes are unavailable or there is a network failure. However, I don't understand why the initial implementation required randomly selecting from all nodes. We should use a default node, such as nodes[0], when all nodes responsible for the hash slot are down.


@srikar-jilugu
Copy link
Author

@monkey92t choosing default node could overwhelm the master node(node[0]) in case of temporary failures, instead we could choose the node with the lowest latency itself when all nodes are failing? wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants