Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve status reporting in UI #1262

Merged
merged 32 commits into from
Feb 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
7d3abf9
Migration to add secondary index on repair runs table.
Miles-Garnsey Jan 10, 2023
37d6eaf
RepairRunResource should prioritise RUNNING repairs when a limit is p…
Miles-Garnsey Jan 10, 2023
b06d357
Stub out the bones of queries required to use new secondary index.
Miles-Garnsey Jan 10, 2023
21e3e7a
Shift around order or RepairRun.RunState so it reflects our desired o…
Miles-Garnsey Jan 11, 2023
8af069e
Implement getRepairRunsForClusterPrioritiseRunning for MemoryStorage.
Miles-Garnsey Jan 11, 2023
12046fe
Additional import for CassandraStorage.
Miles-Garnsey Jan 11, 2023
e31d419
Checkstyle, rename migration.
Miles-Garnsey Jan 11, 2023
bced1f1
A more elegant way to iterate over the RunStates and query for them.
Miles-Garnsey Jan 11, 2023
e0d0bf9
Checkstyle, make sure MemoryStorage returns only the first <limit> ro…
Miles-Garnsey Jan 11, 2023
d61b8d6
Fix use of incorrectly named column in `getRepairRunForClusterWhereSt…
Miles-Garnsey Jan 12, 2023
def2799
Use index table `repair_run_by_cluster_v2` to obtain UUIDs for cluste…
Miles-Garnsey Jan 13, 2023
b5c0989
Avoid out of bounds list indices by ensuring that we always use min(l…
Miles-Garnsey Jan 13, 2023
2ee5be8
Integration test definition.
Miles-Garnsey Jan 13, 2023
094c190
Re-order the RunState enum according to clarified requirements.
Miles-Garnsey Jan 13, 2023
1bb531f
Checkstyle...
Miles-Garnsey Jan 16, 2023
5e126e6
Rework the storage layer queries so that they only use 2i for high pr…
Miles-Garnsey Jan 16, 2023
6c6028f
Fix acceptance test arity issue.
Miles-Garnsey Jan 16, 2023
330b0bb
Alex's suggestions.
Miles-Garnsey Jan 16, 2023
d2e65be
Fix deserialisation problem when getting RepairRun back from API.
Miles-Garnsey Jan 17, 2023
e8d061e
Fix test logic so that repair runs > 1 are aborted and repair run 1 i…
Miles-Garnsey Jan 17, 2023
23c06d1
Need to set repair run to RUNNING before pausing/aborting.
Miles-Garnsey Jan 18, 2023
4cf7a80
While loop not terminating in tests, let's try a for loop.
Miles-Garnsey Jan 18, 2023
e38cc83
Make number of repairs added and aborted configurable.
Miles-Garnsey Jan 19, 2023
86d3a5a
Update src/server/src/main/resources/db/cassandra/032_add_2i_status.cql
Miles-Garnsey Feb 14, 2023
3d9e2f0
Add additional test to check RepairRunStatus ordering against `/repai…
Miles-Garnsey Feb 15, 2023
aa8fb9c
New ordering function for Lists of RepairRuns, using `isTerminated()`…
Miles-Garnsey Feb 15, 2023
093633c
Checkstyle...
Miles-Garnsey Feb 15, 2023
e935f54
Move `SortByRunState()` into `RepairRunService`.
Miles-Garnsey Feb 17, 2023
106043f
Use getRepairRunsForCluster for the repair_runs/ endpoint.
Miles-Garnsey Feb 17, 2023
ccc1f3e
Checkstyle...
Miles-Garnsey Feb 17, 2023
b9320a4
Flip the ordering in RunState comparator.
Miles-Garnsey Feb 17, 2023
b23b40e
Simplify filtering logic for runstates in CassandraStorage.
Miles-Garnsey Feb 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -180,12 +180,15 @@ public String toString() {
return String.format("%s[%s] for %s", getClass().getSimpleName(), id.toString(), clusterName);
}

// The values in this enum are declared in order of "interestingness",
// this is used to order RepairRuns in the UI so that e.g. RUNNING runs come first.
public enum RunState {
NOT_STARTED,
RUNNING,
PAUSED,
NOT_STARTED,

ERROR,
DONE,
PAUSED,
ABORTED,
DELETED;

Expand Down Expand Up @@ -341,4 +344,5 @@ public RepairRun build(UUID id) {
return new RepairRun(this, id);
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static java.lang.Math.min;


@Path("/repair_run")
@Produces(MediaType.APPLICATION_JSON)
Expand Down Expand Up @@ -608,7 +610,9 @@ public Response getRepairRunsForCluster(
@QueryParam("limit") Optional<Integer> limit) {

LOG.debug("get repair run for cluster called with: cluster_name = {}", clusterName);
final Collection<RepairRun> repairRuns = context.storage.getRepairRunsForCluster(clusterName, limit);
final Collection<RepairRun> repairRuns = context
.storage
.getRepairRunsForClusterPrioritiseRunning(clusterName, limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to do the same on line 665. It's the method that lists repair runs throughput all registered clusters.
Currently it's still using the old getRepairRunsForCluster()  method which isn't optimized to prioritize running repairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that on line 665, it is adding the runs from ALL clusters into a list. I'll need to re-sort the runs so that the statuses are in the right order irrespective of cluster. Once you let me know the precise way you want the sorting done, I might just encapsulate it in a function so that we aren't duplicating that logic in multiple places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'll need to merge lists from different clusters and re-sort to apply the limits.
FYI, repair ids are timeuuids, which makes it possible to apply the sorting on them directly using their time component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be done and working now, although the cucumber test doesn't test against multiple clusters. I'm not sure if I should add a multi-cluster test to confirm that this endpoint does indeed work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fairly hard to achieve with the limited resources we have at our disposal.
A mocked test wouldn't help much I guess, so we're left with manual testing for multi cluster 🤷

final Collection<RepairRunStatus> repairRunViews = new ArrayList<>();
for (final RepairRun repairRun : repairRuns) {
repairRunViews.add(getRepairRunStatus(repairRun));
Expand Down Expand Up @@ -661,17 +665,18 @@ public Response listRepairRuns(
: context.storage.getClusters();

List<RepairRunStatus> runStatuses = Lists.newArrayList();
for (final Cluster clstr : clusters) {
Collection<RepairRun> runs = context.storage.getRepairRunsForCluster(clstr.getName(), limit);

runStatuses.addAll(
(List<RepairRunStatus>) getRunStatuses(runs, desiredStates)
.stream()
.filter((run) -> !keyspace.isPresent()
|| ((RepairRunStatus)run).getKeyspaceName().equals(keyspace.get()))
.collect(Collectors.toList()));
}

List<RepairRun> repairRuns = Lists.newArrayList();
clusters.forEach(clstr -> repairRuns.addAll(
context.storage.getRepairRunsForClusterPrioritiseRunning(clstr.getName(), limit))
);
RepairRunService.sortByRunState(repairRuns);
runStatuses.addAll(
(List<RepairRunStatus>) getRunStatuses(
repairRuns.subList(0, min(repairRuns.size(), limit.orElse(1000))), desiredStates)
.stream()
.filter((run) -> !keyspace.isPresent()
|| ((RepairRunStatus)run).getKeyspaceName().equals(keyspace.get()))
.collect(Collectors.toList()));
return Response.ok().entity(runStatuses).build();
} catch (IllegalArgumentException e) {
return Response.serverError().entity("Failed find cluster " + cluster.get()).build();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@

import java.util.Collection;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
Expand Down Expand Up @@ -85,6 +86,22 @@ public static RepairRunService create(AppContext context) {
return new RepairRunService(context, () -> ClusterFacade.create(context));
}

public static void sortByRunState(List<RepairRun> repairRunCollection) {
Comparator<RepairRun> comparator = new Comparator<RepairRun>() {
@Override
public int compare(RepairRun o1, RepairRun o2) {
if (!o1.getRunState().isTerminated() && o2.getRunState().isTerminated()) {
return -1; // o1 appears first.
} else if (o1.getRunState().isTerminated() && !o2.getRunState().isTerminated()) {
return 1; // o2 appears first.
} else { // Both RunStates have equal isFinished() values; compare on time instead.
return o1.getId().compareTo(o2.getId());
}
}
};
Collections.sort(repairRunCollection, comparator);
}

/**
* Creates a repair run but does not start it immediately.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
import java.io.IOException;
import java.math.BigInteger;
import java.time.LocalDate;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.Date;
Expand Down Expand Up @@ -114,6 +115,8 @@
import systems.composable.dropwizard.cassandra.pooling.PoolingOptionsFactory;
import systems.composable.dropwizard.cassandra.retry.RetryPolicyFactory;

import static java.lang.Math.min;

public final class CassandraStorage implements IStorage, IDistributedStorage {

private static final int METRICS_PARTITIONING_TIME_MINS = 10;
Expand Down Expand Up @@ -157,6 +160,7 @@ public RepairUnit load(UUID repairUnitId) throws Exception {
private PreparedStatement insertRepairRunUnitIndexPrepStmt;
private PreparedStatement getRepairRunPrepStmt;
private PreparedStatement getRepairRunForClusterPrepStmt;
private PreparedStatement getRepairRunForClusterWhereStatusPrepStmt;
private PreparedStatement getRepairRunForUnitPrepStmt;
private PreparedStatement deleteRepairRunPrepStmt;
private PreparedStatement deleteRepairRunByClusterPrepStmt;
Expand Down Expand Up @@ -422,6 +426,8 @@ private void prepareStatements() {
.setConsistencyLevel(ConsistencyLevel.QUORUM);
getRepairRunForClusterPrepStmt = session.prepare(
"SELECT * FROM repair_run_by_cluster_v2 WHERE cluster_name = ? limit ?");
getRepairRunForClusterWhereStatusPrepStmt = session.prepare(
"SELECT id FROM repair_run_by_cluster_v2 WHERE cluster_name = ? AND repair_run_state = ? limit ?");
getRepairRunForUnitPrepStmt = session.prepare("SELECT * FROM repair_run_by_unit WHERE repair_unit_id = ?");
deleteRepairRunPrepStmt = session.prepare("DELETE FROM repair_run WHERE id = ?");
deleteRepairRunByClusterPrepStmt
Expand Down Expand Up @@ -957,6 +963,69 @@ public Collection<RepairRun> getRepairRunsForCluster(String clusterName, Optiona
return getRepairRunsAsync(repairRunFutures);
}

@Override
public List<RepairRun> getRepairRunsForClusterPrioritiseRunning(String clusterName, Optional<Integer> limit) {
List<ResultSetFuture> repairUuidFuturesByState = Lists.<ResultSetFuture>newArrayList();
// We've set up the RunState enum so that values are declared in order of "interestingness",
// we iterate over the table via the secondary index according to that ordering.
for (String state:Arrays.asList("RUNNING", "PAUSED", "NOT_STARTED")) {
repairUuidFuturesByState.add(
// repairUUIDFutures will be a List of resultSetFutures, each of which contains a ResultSet of
// UUIDs for one status.
session
.executeAsync(getRepairRunForClusterWhereStatusPrepStmt
.bind(clusterName, state.toString(), limit.orElse(MAX_RETURNED_REPAIR_RUNS)
)
)
);
}
ResultSetFuture repairUuidFuturesNoState = session
.executeAsync(getRepairRunForClusterPrepStmt
.bind(clusterName, limit.orElse(MAX_RETURNED_REPAIR_RUNS)
)
);

List<UUID> flattenedUuids = Lists.<UUID>newArrayList();
// Flatten the UUIDs from each status down into a single array.
for (ResultSetFuture idResSetFuture : repairUuidFuturesByState) {
idResSetFuture
.getUninterruptibly()
.forEach(
row -> flattenedUuids.add(row.getUUID("id"))
);
}
// Merge the two lists and trim.
repairUuidFuturesNoState.getUninterruptibly().forEach(row -> {
UUID uuid = row.getUUID("id");
if (!flattenedUuids.contains(uuid)) {
flattenedUuids.add(uuid);
}
}
);
flattenedUuids.subList(0, min(flattenedUuids.size(), limit.orElse(MAX_RETURNED_REPAIR_RUNS)));

// Run an async query on each UUID in the flattened list, against the main repair_run table with
// all columns required as an input to `buildRepairRunFromRow`.
List<ResultSetFuture> repairRunFutures = Lists.<ResultSetFuture>newArrayList();
flattenedUuids.forEach(uuid ->
repairRunFutures.add(
session
.executeAsync(getRepairRunPrepStmt.bind(uuid)
)
)
);
// Defuture the repair_run rows and build the strongly typed RepairRun objects from the contents.
return repairRunFutures
.stream()
.map(
row -> {
Row extractedRow = row.getUninterruptibly().one();
return buildRepairRunFromRow(extractedRow, extractedRow.getUUID("id"));
}
).collect(Collectors.toList());
}


@Override
public Collection<RepairRun> getRepairRunsForUnit(UUID repairUnitId) {
List<ResultSetFuture> repairRunFutures = Lists.<ResultSetFuture>newArrayList();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ public interface IStorage extends Managed {
/** return all the repair runs in a cluster, in reverse chronological order, with default limit is 1000 */
Collection<RepairRun> getRepairRunsForCluster(String clusterName, Optional<Integer> limit);

Collection<RepairRun> getRepairRunsForClusterPrioritiseRunning(String clusterName, Optional<Integer> limit);

Collection<RepairRun> getRepairRunsForUnit(UUID repairUnitId);

Collection<RepairRun> getRepairRunsWithState(RepairRun.RunState runState);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,14 @@
import io.cassandrareaper.core.Snapshot;
import io.cassandrareaper.resources.view.RepairRunStatus;
import io.cassandrareaper.resources.view.RepairScheduleStatus;
import io.cassandrareaper.service.RepairRunService;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Optional;
Expand All @@ -53,6 +55,8 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static java.lang.Math.min;

/**
* Implements the StorageAPI using transient Java classes.
*/
Expand Down Expand Up @@ -193,6 +197,18 @@ public List<RepairRun> getRepairRunsForCluster(String clusterName, Optional<Inte
return foundRepairRuns;
}

@Override
public List<RepairRun> getRepairRunsForClusterPrioritiseRunning(String clusterName, Optional<Integer> limit) {
List<RepairRun> foundRepairRuns = repairRuns
.values()
.stream()
.filter(
row -> row.getClusterName().equals(clusterName.toLowerCase(Locale.ROOT))).collect(Collectors.toList()
);
RepairRunService.sortByRunState(foundRepairRuns);
return foundRepairRuns.subList(0, min(foundRepairRuns.size(), limit.orElse(1000)));
}

@Override
public Collection<RepairRun> getRepairRunsForUnit(UUID repairUnitId) {
List<RepairRun> foundRepairRuns = new ArrayList<>();
Expand Down
18 changes: 18 additions & 0 deletions src/server/src/main/resources/db/cassandra/032_add_2i_status.cql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
--
-- Copyright 2021-2021 Datastax inc.
--
-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--
-- Add a secondary index on `state` to the `repair_run` table.

CREATE INDEX IF NOT EXISTS state2i ON repair_run_by_cluster_v2 (repair_run_state);
Loading