Skip to content

Commit

Permalink
Retry with backoff on cluster connection failures (#2358)
Browse files Browse the repository at this point in the history
* Split JedisClusterCommand into multiple methods

No behavior changes, just a refactoring.

Changes:
* Replaces recursion with a for loop
* Extract redirection handling into its own method
* Extract connection-failed handling into its own method

Note that `tryWithRandomNode` is gone, it was never `true` so it and its
code didn't survive the refactoring.

* Drop redundant null check

* Bump JDK version to 1.8

Inspired by #1334 where this went real easy :).

Would have made #2355 shorter.

Free public updates for JDK 7 ended in 2015:
<https://en.wikipedia.org/wiki/Java_version_history>

For JDK 8, free public support is available from non-Orace vendors until
at least 2026 according to the same table.

And JDK 8 is what Jedis is being tested on anyway:
<https://github.com/redis/jedis/blob/ac0969315655180c09b8139c16bded09c068d498/.circleci/config.yml#L67-L74>

* Replace ConnectionGetters with lambdas

* Retrigger CI

* Add backoff to Redis connections

* Add unit tests for backoff logic

* Add retries logging

* Always use the user requested timeout

* Remedy review feedback

* Consider connection exceptions and disregard random nodes

* consider connection exceptions and disregard random nodes

* reset redirection

* Revert "Consider connection exceptions and disregard random nodes"

This reverts commit 67a062a.

Lots of tests in JedisClusterCommandTests started failing, need to be
fixed before trying again.

* Add another backoff test case

1. We try to contact master => JedisConnectionException
2. We try to contact replica => It refers us to master, hasn't failed over yet
3. We try to contact master => JedisConnectionException
4. We try to contact replica => Success, because it has now failed over

* consider connection exceptions and disregard random nodes

* reset redirection

* Fix test failure

* Apply suggestions from code review

Co-authored-by: Jens Green Olander <jensgreen@users.noreply.github.com>

* update documentation

* Improve a comment

* Update src/main/java/redis/clients/jedis/DefaultJedisSocketFactory.java

* Add change from another branch

Source (all of these point to the same place):
* walles/retries-split
* 4f80d73
* #2355

* Move JedisClusterCommandTest out of commands package

* Use JedisClusterOperationException

* Reduce sleep time, especially when few attempts left

* Update src/main/java/redis/clients/jedis/JedisClusterCommand.java

* merge fix

* merge fix

* Use maxAttempts

* format import

* Re-add missing codes due to merge

* avoid NPE while zero max attempts

* Remove zero attempts test

* More cluster constructors and customizability

* Use maxTotalRetriesDuration everywhere

* more missing maxTotalRetriesDuration after merge

Co-authored-by: M Sazzadul Hoque <7600764+sazzad16@users.noreply.github.com>
Co-authored-by: Jens Green Olander <jgreen@spotify.com>
Co-authored-by: Jens Green Olander <jensgreen@users.noreply.github.com>
  • Loading branch information
4 people committed Mar 31, 2021
1 parent 71dac36 commit 270bb71
Show file tree
Hide file tree
Showing 9 changed files with 1,055 additions and 526 deletions.
6 changes: 6 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@
<version>2.3.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<version>3.7.7</version>
<scope>test</scope>
</dependency>
</dependencies>

<distributionManagement>
Expand Down
5 changes: 5 additions & 0 deletions src/main/java/redis/clients/jedis/BinaryJedis.java
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,11 @@ public BinaryJedis(final JedisSocketFactory jedisSocketFactory, final JedisClien
initializeFromClientConfig(clientConfig);
}

@Override
public String toString() {
return "BinaryJedis{" + client + '}';
}

public boolean isConnected() {
return client.isConnected();
}
Expand Down
557 changes: 298 additions & 259 deletions src/main/java/redis/clients/jedis/BinaryJedisCluster.java

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions src/main/java/redis/clients/jedis/Connection.java
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ public Connection(final JedisSocketFactory jedisSocketFactory) {
this.soTimeout = jedisSocketFactory.getSoTimeout();
}

@Override
public String toString() {
return "Connection{" + socketFactory + "}";
}

public Socket getSocket() {
return socket;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -220,4 +220,9 @@ public HostAndPortMapper getHostAndPortMapper() {
public void setHostAndPortMapper(HostAndPortMapper hostAndPortMapper) {
this.hostAndPortMapper = hostAndPortMapper;
}

@Override
public String toString() {
return "DefaultJedisSocketFactory{" + hostAndPort.toString() + "}";
}
}
539 changes: 282 additions & 257 deletions src/main/java/redis/clients/jedis/JedisCluster.java

Large diffs are not rendered by default.

91 changes: 82 additions & 9 deletions src/main/java/redis/clients/jedis/JedisClusterCommand.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
package redis.clients.jedis;

import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

Expand All @@ -18,10 +21,22 @@ public abstract class JedisClusterCommand<T> {

private final JedisClusterConnectionHandler connectionHandler;
private final int maxAttempts;
private final Duration maxTotalRetriesDuration;

public JedisClusterCommand(JedisClusterConnectionHandler connectionHandler, int maxAttempts) {
this(connectionHandler, maxAttempts, Duration.ofMillis((long) BinaryJedisCluster.DEFAULT_TIMEOUT * maxAttempts));
}

/**
* @param connectionHandler
* @param maxAttempts
* @param maxTotalRetriesDuration No more attempts after we have been trying for this long.
*/
public JedisClusterCommand(JedisClusterConnectionHandler connectionHandler, int maxAttempts,
Duration maxTotalRetriesDuration) {
this.connectionHandler = connectionHandler;
this.maxAttempts = maxAttempts;
this.maxTotalRetriesDuration = maxTotalRetriesDuration;
}

public abstract T execute(Jedis connection);
Expand Down Expand Up @@ -85,7 +100,10 @@ public T runWithAnyNode() {
}

private T runWithRetries(final int slot) {
Instant deadline = Instant.now().plus(maxTotalRetriesDuration);

JedisRedirectionException redirect = null;
int consecutiveConnectionFailures = 0;
Exception lastException = null;
for (int attemptsLeft = this.maxAttempts; attemptsLeft > 0; attemptsLeft--) {
Jedis connection = null;
Expand All @@ -106,15 +124,21 @@ private T runWithRetries(final int slot) {
throw jnrcne;
} catch (JedisConnectionException jce) {
lastException = jce;
++consecutiveConnectionFailures;
LOG.debug("Failed connecting to Redis: {}", connection, jce);
// "- 1" because we just did one, but the attemptsLeft counter hasn't been decremented yet
handleConnectionProblem(attemptsLeft - 1);
boolean reset = handleConnectionProblem(attemptsLeft - 1, consecutiveConnectionFailures, deadline);
if (reset) {
consecutiveConnectionFailures = 0;
redirect = null;
}
} catch (JedisRedirectionException jre) {
// avoid updating lastException if it is a connection exception
if (lastException == null || lastException instanceof JedisRedirectionException) {
lastException = jre;
}
LOG.debug("Redirected by server to {}", jre.getTargetNode());
consecutiveConnectionFailures = 0;
redirect = jre;
// if MOVED redirection occurred,
if (jre instanceof JedisMovedDataException) {
Expand All @@ -124,6 +148,9 @@ private T runWithRetries(final int slot) {
} finally {
releaseConnection(connection);
}
if (Instant.now().isAfter(deadline)) {
throw new JedisClusterOperationException("Cluster retry deadline exceeded.");
}
}

JedisClusterMaxAttemptsException maxAttemptsException
Expand All @@ -132,14 +159,60 @@ private T runWithRetries(final int slot) {
throw maxAttemptsException;
}

private void handleConnectionProblem(int attemptsLeft) {
if (attemptsLeft <= 1) {
// We need this because if node is not reachable anymore - we need to finally initiate slots
// renewing, or we can stuck with cluster state without one node in opposite case.
// But now if maxAttempts = [1 or 2] we will do it too often.
// TODO make tracking of successful/unsuccessful operations for node - do renewing only
// if there were no successful responses from this node last few seconds
this.connectionHandler.renewSlotCache();
/**
* Related values should be reset if <code>TRUE</code> is returned.
*
* @param attemptsLeft
* @param consecutiveConnectionFailures
* @param doneDeadline
* @return true - if some actions are taken
* <br /> false - if no actions are taken
*/
private boolean handleConnectionProblem(int attemptsLeft, int consecutiveConnectionFailures, Instant doneDeadline) {
if (this.maxAttempts < 3) {
// Since we only renew the slots cache after two consecutive connection
// failures (see consecutiveConnectionFailures above), we need to special
// case the situation where we max out after two or fewer attempts.
// Otherwise, on two or fewer max attempts, the slots cache would never be
// renewed.
if (attemptsLeft == 0) {
this.connectionHandler.renewSlotCache();
return true;
}
return false;
}

if (consecutiveConnectionFailures < 2) {
return false;
}

sleep(getBackoffSleepMillis(attemptsLeft, doneDeadline));
//We need this because if node is not reachable anymore - we need to finally initiate slots
//renewing, or we can stuck with cluster state without one node in opposite case.
//TODO make tracking of successful/unsuccessful operations for node - do renewing only
//if there were no successful responses from this node last few seconds
this.connectionHandler.renewSlotCache();
return true;
}

private static long getBackoffSleepMillis(int attemptsLeft, Instant deadline) {
if (attemptsLeft <= 0) {
return 0;
}

long millisLeft = Duration.between(Instant.now(), deadline).toMillis();
if (millisLeft < 0) {
throw new JedisClusterOperationException("Cluster retry deadline exceeded.");
}

return millisLeft / (attemptsLeft * (attemptsLeft + 1));
}

protected void sleep(long sleepMillis) {
try {
TimeUnit.MILLISECONDS.sleep(sleepMillis);
} catch (InterruptedException e) {
throw new JedisClusterOperationException(e);
}
}

Expand Down
Loading

0 comments on commit 270bb71

Please sign in to comment.