You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, for non-incremental repairs with one segment per node, Reaper is not resilient against changes in token ranges; any repair schedules created before a range change would subsequently fail consistently, resulting in continuous retries. In addition, these failing repairs seem to take consistent priority over repairs for other keyspaces, blocking progress completely.
ERROR
2023-06-07T14:10:55.595Z: ERROR [2023-06-07 14:10:54,386] [****:1cfdb910-df93-11ed-9996-a1697ff8c53a] i.c.j.ClusterFacade - [tokenRangeToEndpoint] no replicas found for token range io.cassandrareaper.core.Segment@41265603 2023-06-07T14:10:55.595Z: ERROR [2023-06-07 14:10:54,387] [****:1cfdb910-df93-11ed-9996-a1697ff8c53a] i.c.s.RepairRunner - RepairRun FAILURE, scheduling retry 2023-06-07T14:10:55.595Z: java.lang.IllegalArgumentException: no hosts provided to connectAny 2023-06-07T14:10:55.595Z: at com.google.common.base.Preconditions.checkArgument(Preconditions.java:135) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:135) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.ClusterFacade.connectImpl(ClusterFacade.java:885) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.ClusterFacade.connect(ClusterFacade.java:869) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.service.RepairRunner.startNextSegment(RepairRunner.java:472) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:235) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) 2023-06-07T14:10:55.596Z: at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-06-07T14:10:55.596Z: at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) 2023-06-07T14:10:55.596Z: at java.base/java.lang.Thread.run(Thread.java:829) 2023-06-07T14:10:00.009Z: ERROR [2023-06-07 14:09:59,816] [****:a8424cd0-df92-11ed-9996-a1697ff8c53a] i.c.j.ClusterFacade - [tokenRangeToEndpoint] no replicas found for token range io.cassandrareaper.core.Segment@60b78a64 2023-06-07T14:10:00.009Z: ERROR [2023-06-07 14:09:59,816] [****:a8424cd0-df92-11ed-9996-a1697ff8c53a] i.c.s.RepairRunner - RepairRun FAILURE, scheduling retry 2023-06-07T14:10:00.009Z: java.lang.IllegalArgumentException: no hosts provided to connectAny 2023-06-07T14:10:00.009Z: at com.google.common.base.Preconditions.checkArgument(Preconditions.java:135) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:135) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.ClusterFacade.connectImpl(ClusterFacade.java:885) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.ClusterFacade.connect(ClusterFacade.java:869) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.service.RepairRunner.startNextSegment(RepairRunner.java:472) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:235) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) 2023-06-07T14:10:00.009Z: at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-06-07T14:10:00.062Z: at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) 2023-06-07T14:10:00.062Z: at java.base/java.lang.Thread.run(Thread.java:829)
Code flow for the error,
RepairRunner
private void startNextSegment() throws ReaperException, InterruptedException {
boolean scheduleRetry = true;
// We want to know whether a repair was started,
// so that a rescheduling of this runner will happen.
boolean repairStarted = false;
// We have an empty slot, so let's start new segment runner if possible.
// When in sidecar mode, filter on ranges that the local node is a replica for only.
LOG.info("Attempting to run new segment...");
List<RepairSegment> nextRepairSegments
= context.config.isInSidecarMode()
? ((IDistributedStorage) context.storage)
.getNextFreeSegmentsForRanges(
repairRunId, localEndpointRanges)
: context.storage.getNextFreeSegments(
repairRunId);
Optional<RepairSegment> nextRepairSegment = Optional.empty();
Collection<String> potentialReplicas = new HashSet<>();
for (RepairSegment segment : nextRepairSegments) {
Map<String, String> potentialReplicaMap = this.repairRunService.getDCsByNodeForRepairSegment(
cluster, segment.getTokenRange(), repairUnit.getKeyspaceName(), repairUnit);
potentialReplicas = repairUnit.getIncrementalRepair()
? Collections.singletonList(segment.getCoordinatorHost())
: potentialReplicaMap.keySet();
JmxProxy coordinator = clusterFacade.connect(cluster, potentialReplicas);
if (nodesReadyForNewRepair(coordinator, segment, potentialReplicaMap, repairRunId)) {
nextRepairSegment = Optional.of(segment);
break;
}
}
Since incremental repair is not in use, Reaper computes potentialReplicas as the keys returned from getDCsByNodeForRepairSegment(), which is based on the current ring state and on data stored when the schedule was created (segment.getTokenRange())
RepairRunService
Map<String, String> getDCsByNodeForRepairSegment(
Cluster cluster,
Segment segment,
String keyspace,
RepairUnit repairUnit) throws ReaperException {
final int maxAttempts = 2;
for (int attempt = 0; attempt < maxAttempts; attempt++) {
try {
JmxProxy jmxConnection = clusterFacade.connect(cluster);
// when hosts are coming up or going down, this method can throw an UndeclaredThrowableException
Collection<String> nodes = clusterFacade.tokenRangeToEndpoint(cluster, keyspace, segment);
Map<String, String> dcByNode = Maps.newHashMap();
nodes.forEach(node -> dcByNode.put(node, EndpointSnitchInfoProxy.create(jmxConnection).getDataCenter(node)));
if (repairUnit.getDatacenters().isEmpty()) {
return dcByNode;
} else {
return dcByNode.entrySet().stream()
.filter(entry -> repairUnit.getDatacenters().contains(entry.getValue()))
.collect(Collectors.toMap(entry -> entry.getKey(), entry -> entry.getValue()));
}
}
The mapping is computed partly based on the output of tokenRangeToEndpoint(), which attempts to find a node that owns the range completely enclosing the segment under repair
ClusterFacade
public List<String> tokenRangeToEndpoint(Cluster cluster, String keyspace, Segment segment) {
Set<Map.Entry<List<String>, List<String>>> entries;
try {
entries = getRangeToEndpointMap(cluster, keyspace).entrySet();
} catch (ReaperException e) {
LOG.error("[tokenRangeToEndpoint] no replicas found for token range {}", segment, e);
return Lists.newArrayList();
}
for (Map.Entry<List<String>, List<String>> entry : entries) {
BigInteger rangeStart = new BigInteger(entry.getKey().get(0));
BigInteger rangeEnd = new BigInteger(entry.getKey().get(1));
if (new RingRange(rangeStart, rangeEnd).encloses(segment.getTokenRanges().get(0))) {
return entry.getValue();
}
}
LOG.error("[tokenRangeToEndpoint] no replicas found for token range {}", segment);
LOG.debug("[tokenRangeToEndpoint] checked token ranges were {}", entries);
return Lists.newArrayList();
}
With one segment per node, each Segment corresponds to a single token range. If any additive/lateral range movements occur, then for at least some stored segments, no single endpoint will completely enclose its range. So tokenRangeToEndpoint() will return an empty list, which results in getDCsByNodeForRepairSegment() to return an empty map, which ultimately results in an empty list of potential coordinators being passed to connectAny()
RepairRunner - JmxConnectionFactory
@VisibleForTesting
public final JmxProxy connectAny(Collection<Node> nodes) throws ReaperException {
Preconditions.checkArgument(
null != nodes && !nodes.isEmpty(), "no hosts provided to connectAny");
List<Node> nodeList = new ArrayList<>(nodes);
Collections.shuffle(nodeList);
resulting in the errors logged above
The cluster in use has no incremental repair. What is the right way to proceed here, does it make sense to special case use cases involving range movement, in the same way that incremental repair is special cased.
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: REAP-33
The text was updated successfully, but these errors were encountered:
what would be the direction to proceed, I see code change in 3.2.1 for reaper to be resilient for topology change with incremental repair, updated code and backend data update, should i approach in the same way for the above issue
Hi @rathan1723, repairs cannot be resilient to topology changes as the token ranges will be different than the ones that were used to compute the segments. What we did is make them resilient to ip address changes when the topology doesn't change.
Now what we'd need to do is make sure actual topology changes will fail the repair early on instead of trying over and over again to re-run the segments that are no longer valid.
Project board link
Reaper Version in use v 3.2.0
Currently, for non-incremental repairs with one segment per node, Reaper is not resilient against changes in token ranges; any repair schedules created before a range change would subsequently fail consistently, resulting in continuous retries. In addition, these failing repairs seem to take consistent priority over repairs for other keyspaces, blocking progress completely.
ERROR
2023-06-07T14:10:55.595Z: ERROR [2023-06-07 14:10:54,386] [****:1cfdb910-df93-11ed-9996-a1697ff8c53a] i.c.j.ClusterFacade - [tokenRangeToEndpoint] no replicas found for token range io.cassandrareaper.core.Segment@41265603 2023-06-07T14:10:55.595Z: ERROR [2023-06-07 14:10:54,387] [****:1cfdb910-df93-11ed-9996-a1697ff8c53a] i.c.s.RepairRunner - RepairRun FAILURE, scheduling retry 2023-06-07T14:10:55.595Z: java.lang.IllegalArgumentException: no hosts provided to connectAny 2023-06-07T14:10:55.595Z: at com.google.common.base.Preconditions.checkArgument(Preconditions.java:135) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:135) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.ClusterFacade.connectImpl(ClusterFacade.java:885) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.jmx.ClusterFacade.connect(ClusterFacade.java:869) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.service.RepairRunner.startNextSegment(RepairRunner.java:472) 2023-06-07T14:10:55.596Z: at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:235) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) 2023-06-07T14:10:55.596Z: at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) 2023-06-07T14:10:55.596Z: at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-06-07T14:10:55.596Z: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-06-07T14:10:55.596Z: at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) 2023-06-07T14:10:55.596Z: at java.base/java.lang.Thread.run(Thread.java:829) 2023-06-07T14:10:00.009Z: ERROR [2023-06-07 14:09:59,816] [****:a8424cd0-df92-11ed-9996-a1697ff8c53a] i.c.j.ClusterFacade - [tokenRangeToEndpoint] no replicas found for token range io.cassandrareaper.core.Segment@60b78a64 2023-06-07T14:10:00.009Z: ERROR [2023-06-07 14:09:59,816] [****:a8424cd0-df92-11ed-9996-a1697ff8c53a] i.c.s.RepairRunner - RepairRun FAILURE, scheduling retry 2023-06-07T14:10:00.009Z: java.lang.IllegalArgumentException: no hosts provided to connectAny 2023-06-07T14:10:00.009Z: at com.google.common.base.Preconditions.checkArgument(Preconditions.java:135) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:135) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.ClusterFacade.connectImpl(ClusterFacade.java:885) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.jmx.ClusterFacade.connect(ClusterFacade.java:869) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.service.RepairRunner.startNextSegment(RepairRunner.java:472) 2023-06-07T14:10:00.009Z: at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:235) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) 2023-06-07T14:10:00.009Z: at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) 2023-06-07T14:10:00.009Z: at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-06-07T14:10:00.009Z: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-06-07T14:10:00.062Z: at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) 2023-06-07T14:10:00.062Z: at java.base/java.lang.Thread.run(Thread.java:829)
Code flow for the error,
RepairRunner
Since incremental repair is not in use, Reaper computes potentialReplicas as the keys returned from getDCsByNodeForRepairSegment(), which is based on the current ring state and on data stored when the schedule was created (segment.getTokenRange())
RepairRunService
The mapping is computed partly based on the output of tokenRangeToEndpoint(), which attempts to find a node that owns the range completely enclosing the segment under repair
ClusterFacade
With one segment per node, each Segment corresponds to a single token range. If any additive/lateral range movements occur, then for at least some stored segments, no single endpoint will completely enclose its range. So tokenRangeToEndpoint() will return an empty list, which results in getDCsByNodeForRepairSegment() to return an empty map, which ultimately results in an empty list of potential coordinators being passed to connectAny()
RepairRunner - JmxConnectionFactory
resulting in the errors logged above
The cluster in use has no incremental repair. What is the right way to proceed here, does it make sense to special case use cases involving range movement, in the same way that incremental repair is special cased.
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: REAP-33
The text was updated successfully, but these errors were encountered: