Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce nodeFilter Predicate to filter Partitions #1942

Closed
jhmartin opened this issue Dec 22, 2021 · 11 comments
Closed

Introduce nodeFilter Predicate to filter Partitions #1942

jhmartin opened this issue Dec 22, 2021 · 11 comments
Labels
type: feature A new feature
Milestone

Comments

@jhmartin
Copy link

jhmartin commented Dec 22, 2021

Bug Report

Current Behavior

When Lettuce is connected to a 2 node Redis Cluster (1 shard, 1 replica), and is configured for REPLICA_PREFERRED, and the replica ceases responding to TCP (such as a ungraceful hardware failure), Lettuce does not recover until the TCP retry counter expires for that connection (~926 seconds).

Input Code

Input Code (Assumes the hostname `redis` resolves to all nodes in the group)
import io.lettuce.core.cluster.*;
import io.lettuce.core.cluster.api.sync.*;
import io.lettuce.core.cluster.api.*;
import io.lettuce.core.*;
import java.util.concurrent.TimeUnit;
import java.time.Duration;
 
public class RedisExample {
    public static void main(String[] args) {
        RedisURI redisUri = RedisURI.Builder.redis("redis").build();
 
        RedisClusterClient clusterClient = RedisClusterClient.create(redisUri);
 
        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions
           .builder()
           .enablePeriodicRefresh(60, TimeUnit.SECONDS)
           .enableAllAdaptiveRefreshTriggers()
           .dynamicRefreshSources(true)
           .closeStaleConnections(true)
           .build();
       TimeoutOptions  timeoutOptions = TimeoutOptions
          .builder()
          .timeoutCommands()
          .fixedTimeout(Duration.ofMillis(400))
          .build();
 
        SocketOptions socketOptions = SocketOptions
          .builder()
          .connectTimeout(500, TimeUnit.MILLISECONDS)
          .build();
 
        clusterClient.setOptions(ClusterClientOptions
            .builder()
            .autoReconnect(true)
            .socketOptions(socketOptions)
            .cancelCommandsOnReconnectFailure(true)
            .timeoutOptions(timeoutOptions)
            .disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
            .topologyRefreshOptions(topologyRefreshOptions)
            .validateClusterNodeMembership(true)
            .suspendReconnectOnProtocolFailure(true)
            .build());
 
        StatefulRedisClusterConnection<String, String> connection = clusterClient.connect();
        RedisAdvancedClusterCommands<String, String> syncCommands = connection.sync();
        connection.setReadFrom(ReadFrom.REPLICA_PREFERRED);
 
        String value1 = syncCommands.set("foo", "bar");
 
        while (true) {
         try {
          Thread.sleep(1000);
          String value = syncCommands.get("foo");
          if (! value.equals(value2)) {
            System.out.println("bad response:" + value + ":" + value2 + ":");
          } else {
            System.out.println("Good response: " + value);
          }
         } catch (Exception e) {
            System.out.println("Error response: " + e);
         }
       }
    }
}

Expected behavior/code

When the timeout is reached and a dynamic topology refresh is triggered, connections to the node in "fail?" state should be considered stale and closed / abandoned.

Environment

  • Lettuce version(s): 6.1.5.RELEASE
  • Redis server v=6.2.6 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=3f28004270edf9dc
  • OpenJDK Runtime Environment (build 1.8.0_312-b07)
  • Amazon Linux 2, 5.10.75-79.358.amzn2.x86_64, t3a.small

Possible Solution

A workaround, undesirable as it is global-to-the-node in nature, is to shorten the TCP retry counter on the client:

echo 5 >/proc/sys/net/ipv4/tcp_retries2

At first glance it looks like adding a filter for failed/eventual_fail nodes at https://github.com/lettuce-io/lettuce-core/blob/cda3be6b9477da790365ad098c6e39c8687f5002/src/main/java/io/lettuce/core/cluster/topology/DefaultClusterTopologyRefresh.java#L292-L296 would cap the duration of the failure scenario at the periodic-topology-refresh interval.

Additional context

TCPdump clearly shows the client getting a topology refresh occurs, but the client does not recover until the existing TCP connection exits.

17:58:54.396570 IP 10.0.0.41.6379 > 10.0.0.218.46894: Flags [P.], seq 150:428, ack 117, win 490, options [nop,nop,TS val 3343612451 ecr 121987982], length 278: RESP "=270" "txt:e03b0b3b56ca33dc759fb6a122a903c7ac47d8f7 10.0.0.41:6379@16379 myself,master - 0 0 0 connected 0-16383" "215c649d39c0182c82aec8fc7e533cd57c052b9a 10.0.0.101:6379@16379 slave,fail? e03b0b3b56ca33dc759fb6a122a903c7ac47d8f7 1640109483742 1640109482738 0 connected"

Failure of the replica node was simulated by dropping all Redis packets on the replica:

$ for x in INPUT OUTPUT; do for y in 6379 16379; do iptables -I $x -p tcp --dport $y -j DROP; done; done

Redis.conf contains:

port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
maxmemory 1gb

TCP KeepAlive's do not help here as the connection is not idle.

@mp911de
Copy link
Collaborator

mp911de commented Jan 4, 2022

You can override RedisClusterClient.determinePartitions(…) to post-process which nodes are available in the topology view. The client obtains topology information from all cluster nodes and uses the topology view that makes the most sense. Since each Redis server can have a different perspective on the other nodes, there is no single source of truth but rather requires a consensus to select the most appropriate one. And the view of Redis servers may be different than the perspective of the client.

I'm not sure whether TCP keep-alive is helping in this case.

@jhmartin
Copy link
Author

jhmartin commented Jan 4, 2022

@mp911de Yeah I don't think TCP keep-alive helps either, but I wanted to call out that I had looked at it as a possibility.

@jhmartin
Copy link
Author

jhmartin commented Jan 4, 2022

I tested reworking https://github.com/lettuce-io/lettuce-core/blob/4110f2820766c4967951639aa2b6bdd9d50466be/src/main/java/io/lettuce/core/cluster/RedisClusterClient.java#L1025-L1032 to filter out FAIL and EVENTUAL_FAIL nodes and achieved the same recovery after the periodic-topology-refresh interval.

@mp911de
Copy link
Collaborator

mp911de commented Jan 5, 2022

I wonder whether it makes generally sense to introduce a node filter (Predicate<RedisClusterNode>) that is used to filter Partitions. It's easier to provide a config value than subclassing the client.

@jhmartin
Copy link
Author

jhmartin commented Jan 5, 2022

I like the sound of that over a subclass; the behavior will look more baked-in. I did try calling removeIf against the Partitions collection with a predicate and it threw a unimplemented-error, so I guess that'd have to be added as well.

@mp911de mp911de changed the title Cluster failover stalls until system TCP retry exhausted Introduce nodeFilter Predicate to filter Partitions Jan 7, 2022
@mp911de mp911de added this to the 6.1.6 milestone Jan 7, 2022
@mp911de mp911de closed this as completed Jan 7, 2022
@mp911de
Copy link
Collaborator

mp911de commented Jan 7, 2022

That's in place now.

@jhmartin
Copy link
Author

jhmartin commented Jan 7, 2022

Adding
.nodeFilter(it -> ! (it.is(RedisClusterNode.NodeFlag.FAIL) || it.is(RedisClusterNode.NodeFlag.EVENTUAL_FAIL)))
gained me the desired behavior.

@srgsanky
Copy link

Is this the default behavior in latest lettuce without explicitly specifying the node filters?

If the answer is no, what is the reason for not making this the default behavior?

@mp911de
Copy link
Collaborator

mp911de commented Jul 12, 2023

You can specify a Predicate now. This new functionality allows additional filtering so that we do not break existing applications.

@srgsanky
Copy link

Thanks for the reason!

Can this be made the default in the next major version? This seems like a good default for most use cases. Folks run into this problem frequently that there is even a guide to set the predicates in AWS's documentation https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/BestPractices.Clients-lettuce.html#:~:text=Set%20nodeFilter,retrying%20is%20exhausted.

@mp911de
Copy link
Collaborator

mp911de commented Jul 13, 2023

Thanks @srgsanky for the background. Can you file a new ticket to flip the default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

No branches or pull requests

3 participants