Skip to content

Commit

Permalink
Cross cluster failover (#3310)
Browse files Browse the repository at this point in the history
* Created Cross-Cluster Failover Capability within UnifiedJedis

- Created new CircuitBreakerCommandExecutor to leverage new retry and circuit breaker capability for failover
- Created new MultiClusterJedislientConfig to encapsulate resilience4j configurations
- Created new MultiClusterPooledConnectionProvider to encapsulate multi-cluster management and operational capabilities
- Created new JedisValidationException
- Added new constructor to UnifiedJedis
- Added resilience4j to pom.xml

* Thread Safety Updates

- Changed CircuitBreakerCommandExecutor to be more thread safe by passing cluster by reference instead of multiple lookups
- Exposed MultiClusterPooledConnectionProvider.Cluster as public so it can be accessed and passed by reference within CircuitBreakerCommandExecutor
- Made some javadocs updates for easier readability
- Removed debug from happy path so it has parity with other executors. It would likely be too busy on the logs in a production system anyway

* Updated logging for clarity and consistency

- Moved log responsibility into provider for consistency
- Added logging for manual failback/failover with consistent wording to the automated failover
- Provided a better log for when the prioritized list is exhausted

* Changes to synchronization logic for activeMultiClusterIndex mutations

- Added more safety for orchestration within mutation-operations on the activeMultiClusterIndex to avoid edge cases. In practice this will never likely come up but better to be extra careful as to avoid a deadlock or inaccurate transitions

* Changed resilience4j dependencies to optional

* Fix to avoid Nullpointer in the event that all connections are unavailable

- Moved increment below a validation so subsequent calls to lookup the cluster connection do not throw a nullpointer exception
- Replaced custom connection close logic with try-with-resources statements

* Handled graceful failure for scenario in which failover is no longer possible

- Added logic to fallback method that handles subsequent calls after all failover attempts have been exhausted and only a manual failback can resume operations
- Added new flag to indicate that all attempts to failover have been exhausted
- Changed comments to clarify that an endpoint can belong to a cluster but also a database so it is more OSS friendly
- Added logic to the manual failback method to allow an existing cluster to reattempt to connect to its current cluster/database in case its the only option that became available

* Updated exception message for clarity

* Added Cluster Failover Post Processor

- Users can now configure their custom logic to persist the activeMultiClusterIndex or custom logging after a successful cluster failover via a functional interface

* Changed ClusterFailoverPostProcessor parameter from index to Circuitbreaker name

* Add failover docs

* Apply suggestions from code review

Co-authored-by: Kyle Banker <banker@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Kyle Banker <banker@users.noreply.github.com>

* Added Unit Tests for MultiClusterPoolConnectionProvider

* Address UnifiedJedis regression

* Fix for MultiClusterPooledConnectionProviderTest unit test

* Added data cleanup to MultiClusterPooledConnectionProviderTest unit test

* Updated MultiClusterPooledConnectionProvider to force a JedisConnectionException

* Simplify the README's failover docs

* Quick Fix on MultiClusterPooledConnectionProviderTest

* Changed exception message format for MultiClusterPooledConnectionProvider validateTargetConnection

* Forward to GitHub Discussions

* Changed name of the class MultiClusterJedisClientConfig => MultiClusterClientConfig

* Address class renamings in doc

* Removed remaining traces of jedis nomenclature from MultiClusterJedisClientConfig

* Changed ClusterClientConfig to ClusterConfig

* Address inner class rename in doc

---------

Co-authored-by: Allen Terleto <allen@redislabs.com>
Co-authored-by: Kyle Banker <banker@users.noreply.github.com>
Co-authored-by: M Sazzadul Hoque <7600764+sazzad16@users.noreply.github.com>
  • Loading branch information
4 people authored May 26, 2023
1 parent 125ee24 commit 967cceb
Show file tree
Hide file tree
Showing 9 changed files with 1,171 additions and 0 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,15 @@ Now you can use the `JedisCluster` instance and send commands like you would wit
jedis.sadd("planets", "Mars");
```

## Failover

Jedis supports retry and failover for your Redis deployments. This is useful when:

1. You have more than one Redis deployment. This might include two independent Redis servers or two or more Redis databases replicated across multiple [active-active Redis Enterprise](https://docs.redis.com/latest/rs/databases/active-active/) clusters.
2. You want your application to connect to one deployment at a time and to fail over to the next available deployment if the first deployment becomes unavailable.

For the complete failover configuration options and examples, see the [Jedis failover docs](docs/failover.md).

## Documentation

The [Jedis wiki](http://github.com/redis/jedis/wiki) contains several useful articles for using Jedis.
Expand Down
225 changes: 225 additions & 0 deletions docs/failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# Failover with Jedis

Jedis supports failover for your Redis deployments. This is useful when:
1. You have more than one Redis deployment. This might include two independent Redis servers or two or more Redis databases replicated across multiple [active-active Redis Enterprise](https://docs.redis.com/latest/rs/databases/active-active/) clusters.
2. You want your application to connect to and use one deployment at a time.
3. You want your application to fail over to the next available deployment if the current deployment becomes unavailable.

Jedis will fail over to a subsequent Redis deployment after reaching a configurable failure threshold.
This failure threshold is implemented using a [circuit breaker pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).

You can also configure Jedis to retry failed calls to Redis.
Once a maximum number of retries have been exhausted, the circuit breaker will record a failure.
When the circuit breaker reaches its failure threshold, a failover will be triggered on the subsequent operation.

The remainder of this guide describes:

* A basic failover configuration
* Supported retry and circuit breaker settings
* Failback and the cluster selection API

We recommend that you read this guide carefully and understand the configuration settings before enabling Jedis failover
in production.

## Basic usage

To configure Jedis for failover, you specify an ordered list of Redis databases.
By default, Jedis will connect to the first Redis database in the list. If the first database becomes unavailable,
Jedis will attempt to connect to the next database in the list, and so on.

Suppose you run two Redis deployments.
We'll call them `redis-east` and `redis-west`.
You want your application to first connect to `redis-east`.
If `redis-east` becomes unavailable, you want your application to connect to `redis-west`.

Let's look at one way of configuring Jedis for this scenario.

First, create an array of `ClusterConfig` objects, one for each Redis database.

```java
JedisClientConfig config = DefaultJedisClientConfig.builder().user("cache").password("secret").build();

ClusterConfig[] clientConfigs = new ClusterConfig[2];
clientConfigs[0] = new ClusterConfig(new HostAndPort("redis-east.example.com", 14000), config);
clientConfigs[1] = new ClusterConfig(new HostAndPort("redis-west.example.com", 14000), config);
```

The configuration above represents your two Redis deployments: `redis-east` and `redis-west`.
You'll use this array of configuration objects to create a connection provider that supports failover.

Use the `MultiClusterClientConfig` builder to set your preferred retry and failover configuration, passing in the client configs you just created.
Then build a `MultiClusterPooledConnectionProvider`.

```java
MultiClusterClientConfig.Builder builder = new MultiClusterClientConfig.Builder(clientConfigs);
builder.circuitBreakerSlidingWindowSize(10);
builder.circuitBreakerSlidingWindowMinCalls(1);
builder.circuitBreakerFailureRateThreshold(50.0f);

MultiClusterPooledConnectionProvider provider = new MultiClusterPooledConnectionProvider(builder.build());
```

Internally, the connection provider uses a [highly configurable circuit breaker and retry implementation](https://resilience4j.readme.io/docs/circuitbreaker) to determine when to fail over.
In the configuration here, we've set a sliding window size of 10 and a failure rate threshold of 50%.
This means that a failover will be triggered if 5 out of any 10 calls to Redis fail.

Once you've configured and created a `MultiClusterPooledConnectionProvider`, instantiate a `UnifiedJedis` instance for your application, passing in the provider you just created:

```java
UnifiedJedis jedis = new UnifiedJedis(provider);
```

You can now use this `UnifiedJedis` instance, and the connection management and failover will be handled transparently.

## Configuration options

Under the hood, Jedis' failover support relies on [resilience4j](https://resilience4j.readme.io/docs/getting-started),
a fault-tolerance library that implements [retry](https://resilience4j.readme.io/docs/retry) and [circuit breakers](https://resilience4j.readme.io/docs/circuitbreaker).

Once you configure Jedis for failover using the `MultiClusterPooledConnectionProvider`, each call to Redis is decorated with a resilience4j retry and circuit breaker.

By default, any call that throws a `JedisConnectionException` will be retried up to 3 times.
If the call continues to fail after the maximum number of retry attempts, then the circuit breaker will record a failure.

The circuit breaker maintains a record of failures in a sliding window data structure.
If the failure rate reaches a configured threshold (e.g., when 50% of the last 10 calls have failed),
then the circuit breaker's state transitions from `CLOSED` to `OPEN`.
When this occurs, Jedis will attempt to connect to the next Redis database in its client configuration list.

The supported retry and circuit breaker settings, and their default values, are described below.
You can configure any of these settings using the `MultiClusterClientConfig.Builder` builder.
Refer the basic usage above for an example of this.

### Retry configuration

Jedis uses the following retry settings:

| Setting | Default value | Description |
|----------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Max retry attempts | 3 | Maximum number of retry attempts (including the initial call) |
| Retry wait duration | 500 ms | Number of milliseconds to wait between retry attempts |
| Wait duration backoff multiplier | 2 | Exponential backoff factor multiplied against wait duration between retries. For example, with a wait duration of 1 second and a multiplier of 2, the retries would occur after 1s, 2s, 4s, 8s, 16s, and so on. |
| Retry included exception list | `JedisConnectionException` | A list of `Throwable` classes that count as failures and should be retried. |
| Retry ignored exception list | Empty list | A list of `Throwable` classes to explicitly ignore for the purposes of retry. |

To disable retry, set `maxRetryAttempts` to 1.

### Circuit breaker configuration

Jedis uses the following circuit breaker settings:

| Setting | Default value | Description |
|-----------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sliding window type | `COUNT_BASED` | The type of sliding window used to record the outcome of calls. Options are `COUNT_BASED` and `TIME_BASED`. |
| Sliding window size | 100 | The size of the sliding window. Units depend on sliding window type. When `COUNT_BASED`, the size represents number of calls. When `TIME_BASED`, the size represents seconds. |
| Sliding window min calls | 100 | Minimum number of calls required (per sliding window period) before the CircuitBreaker will start calculating the error rate or slow call rate. |
| Failure rate threshold | `50.0f` | Percentage of calls within the sliding window that must fail before the circuit breaker transitions to the `OPEN` state. |
| Slow call duration threshold | 60000 ms | Duration threshold above which calls are classified as slow and added to the sliding window. |
| Slow call rate threshold | `100.0f` | Percentage of calls within the sliding window that exceed the slow call duration threshold before circuit breaker transitions to the `OPEN` state. |
| Circuit breaker included exception list | `JedisConnectionException` | A list of `Throwable` classes that count as failures and add to the failure rate. |
| Circuit breaker ignored exception list | Empty list | A list of `Throwable` classes to explicitly ignore for failure rate calculations. | |

### Failover callbacks

In the event that Jedis fails over, you may wish to take some action. This might include logging a warning, recording
a metric, or externally persisting the cluster connection state, to name just a few examples. For this reason,
`MultiPooledConnectionProvider` lets you register a custom callback that will be called whenever Jedis
fails over to a new cluster.

To use this feature, you'll need to design a class that implements `java.util.function.Consumer`.
This class must implement the `accept` method, as you can see below.

```java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.function.Consumer;

public class FailoverReporter implements Consumer<String> {

@Override
public void accept(String clusterName) {
Logger logger = LoggerFactory.getLogger(FailoverReporter.class);
logger.warn("Jedis failover to cluster: " + clusterName);
}
}
```

You can then pass an instance of this class to your `MultiPooledConnectionProvider`.

```
FailoverReporter reporter = new FailoverReporter();
provider.setClusterFailoverPostProcessor(reporter);
```

The provider will call your `accept` whenver a faoliver occurs.

## Failing back

We believe that failback should not be automatic.
If Jedis fails over to a new cluster, Jedis will _not_ automatically fail back to the cluster that it was previously connected to.
This design prevents a scenario in which Jedis fails back to a cluster that may not be entirely healthy yet.

That said, we do provide an API that you can use to implement automated failback when this is appropriate for your application.

## Failback scenario

When a failover is triggered, Jedis will attempt to connect to the next Redis server in the list of server configurations
you provide at setup.

For example, recall the `redis-east` and `redis-west` deployments from the basic usage example above.
Jedis will attempt to connect to `redis-east` first.
If `redis-east` becomes unavailable (and the circuit breaker transitions), then Jedis will attempt to use `redis-west`.

Now suppose that `redis-east` eventually comes back online.
You will likely want to fail your application back to `redis-east`.
However, Jedis will not fail back to `redis-east` automatically.

In this case, we recommend that you first ensure that your `redis-east` deployment is healthy before you fail back your application.

## Failback behavior and cluster selection API

Once you've determined that it's safe to fail back to a previously-unavailable cluster,
you need to decide how to trigger the failback. There are two ways to accomplish this:

1. Use the cluster selection API
2. Restart your application

### Fail back using the cluster selection API

`MultiClusterPooledConnectionProvider` exposes a method that you can use to manually select which cluster Jedis should use.
To select a different cluster to use, pass the cluster's numeric index to `setActiveMultiClusterIndex()`.

The cluster's index is a 1-based index derived from its position in the client configuration.
For example, suppose you configure Jedis with the following client configs:

```
ClusterConfig[] clientConfigs = new ClusterConfig[2];
clientConfigs[0] = new ClusterConfig(new HostAndPort("redis-east.example.com", 14000), config);
clientConfigs[1] = new ClusterConfig(new HostAndPort("redis-west.example.com", 14000), config);
```

In this case, `redis-east` will have an index of `1`, and `redis-west` will have an index of `2`.
To select and fail back to `redis-east`, you would call the function like so:

```
provider.setActiveMultiClusterIndex(1);
```

This method is thread-safe.

If you decide to implement manual failback, you will need a way for external systems to trigger this method in your
application. For example, if your application exposes a REST API, you might consider creating a REST endpoint
to call `setActiveMultiClusterIndex` and fail back the application.

### Fail back by restarting the application

When your application starts, Jedis will attempt to connect to each cluster in the order that the clusters appear
in your client configuration. It's important to understand this, especially in the case where Jedis has failed over.
If Jedis has failed over to a new cluster, then restarting the application may result in an inadvertent failback.
This can happen only if a failed cluster comes back online and the application subsequently restarts.

If you need to avoid this scenario, consider using a failover callback, as described above, to externally record
the name of the cluster that your application was most recently connected to. You can then check this state on startup
to ensure that you application only connects to the most recently used cluster. For assistance with this technique,
[start a discussion](https://github.com/redis/jedis/discussions/new?category=q-a).
19 changes: 19 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
<github.global.server>github</github.global.server>
<slf4j.version>1.7.36</slf4j.version>
<jedis.module.name>redis.clients.jedis</jedis.module.name>
<resilience4j.version>1.7.1</resilience4j.version>
</properties>

<dependencies>
Expand Down Expand Up @@ -115,6 +116,24 @@
<version>2.14.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-all</artifactId>
<version>${resilience4j.version}</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-circuitbreaker</artifactId>
<version>${resilience4j.version}</version>
<optional>true</optional>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-retry</artifactId>
<version>${resilience4j.version}</version>
<optional>true</optional>
</dependency>
</dependencies>

<distributionManagement>
Expand Down
Loading

0 comments on commit 967cceb

Please sign in to comment.