Introducing partition-level histogram into adaptive tracker #1174

jsjtzyy · 2019-05-23T04:50:33Z

Introduced partition-level histograms that keep track of latency of
requests against each partition separately.
Introduced OperationTrackerScope that allows user to choose either
ColoWide or PartitionLevel histogram in adaptive tracker.
Make reservoir size and decay factor configurable in Histogram.

1. Introduced partition-level histograms that keep track of latency of requests against each partition separately. 2. Introduced OperationTrackerScope that allows user to choose either ColoWide or PartitionLevel histogram in adaptive tracker. 3. Make reservoir size and decay factor configurable in Histogram.

jsjtzyy · 2019-05-23T04:50:57Z

Initial commit. Keep adding java docs and tests.

codecov-io · 2019-05-23T05:05:50Z

Codecov Report

Merging #1174 into master will decrease coverage by 0.37%.
The diff coverage is 95.34%.

@@             Coverage Diff              @@
##             master    #1174      +/-   ##
============================================
- Coverage     70.06%   69.69%   -0.38%     
- Complexity     5378     5396      +18     
============================================
  Files           428      430       +2     
  Lines         32791    33015     +224     
  Branches       4136     4173      +37     
============================================
+ Hits          22975    23009      +34     
- Misses         8691     8866     +175     
- Partials       1125     1140      +15

Impacted Files	Coverage Δ	Complexity Δ
...com.github.ambry/router/OperationTrackerScope.java	`100% <100%> (ø)`	`1 <1> (?)`
...java/com.github.ambry.router/GetBlobOperation.java	`91.68% <100%> (ø)`	`39 <0> (ø)`	⬇️
...ain/java/com.github.ambry/config/RouterConfig.java	`100% <100%> (ø)`	`1 <0> (ø)`	⬇️
....github.ambry.router/NonBlockingRouterMetrics.java	`94.14% <100%> (+0.22%)`	`46 <3> (+5)`	⬆️
...ain/java/com.github.ambry.router/GetOperation.java	`96.92% <100%> (-0.14%)`	`27 <0> (ø)`
.../com.github.ambry.router/GetBlobInfoOperation.java	`85.31% <100%> (ø)`	`42 <0> (ø)`	⬇️
....github.ambry.router/AdaptiveOperationTracker.java	`93.02% <90.9%> (-2.9%)`	`24 <16> (+17)`
...com.github.ambry.clustermap/HelixAdminFactory.java	`50% <0%> (-50%)`	`1% <0%> (-1%)`
...ava/com.github.ambry.cloud/CloudBackupManager.java	`43.95% <0%> (-39.83%)`	`4% <0%> (+1%)`
...m.github.ambry.replication/ReplicationManager.java	`51.06% <0%> (-39.57%)`	`3% <0%> (-1%)`
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09f0d30...c33df65. Read the comment docs.

jsjtzyy · 2019-05-24T21:07:18Z

./gradlew build && ./gradlew test succeeded. @zzmao @cgtz the PR is ready for review.

zzmao · 2019-05-28T23:56:51Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

@@ -37,15 +37,17 @@
 * perceived latencies.
 */
 class AdaptiveOperationTracker extends SimpleOperationTracker {
-  static final long MIN_DATA_POINTS_REQUIRED = 1000;
-
+  private final RouterConfig routerConfig;
  private final Time time;
  private final double quantile;
  private final Histogram localColoTracker;


Looks like it's time to rename *Tracker to * Histogram .

zzmao · 2019-05-28T23:57:29Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

  private final Time time;
  private final double quantile;
  private final Histogram localColoTracker;
  private final Histogram crossColoTracker;
  private final Counter pastDueCounter;
  private final OpTrackerIterator otIterator;
  private Iterator<ReplicaId> replicaIterator;
+  private Map<PartitionId, Histogram> localColoPartitionAndLatency;


Rename to localColoPartitionToHistogram? or localColoHistogramByPartition`

jsjtzyy · 2019-05-30T16:49:10Z

Addressed @zzmao 's comments. @cgtz gentle reminder to review.

cgtz · 2019-06-01T01:00:17Z

ambry-api/src/main/java/com.github.ambry/config/RouterConfig.java

+    for (OperationTrackerScope scope : OperationTrackerScope.values()) {
+      validTrackerScopes.add(scope.toString());
+    }
+    routerOperationTrackerMetricScope = validTrackerScopes.contains(scopeStr) ? OperationTrackerScope.valueOf(scopeStr)


why not just throw an exception if the operator sets an invalid config value? I think the current behavior might hide config typos.

You could then express this code as just routerOperationTrackerMetricScope = OperationTrackerScope.valueOf(scopeStr)

Previously I thought we should allow frontend to start up even though the scope is invalid (by using default scope). Taking your point into consideration, I feel like we should explicitly throw exception to remind DEV/SRE the config is invalid as opposed to using default scope that we are not even aware of.
I will make the change.

cgtz · 2019-06-01T01:01:01Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

  private final Counter pastDueCounter;
  private final OpTrackerIterator otIterator;
  private Iterator<ReplicaId> replicaIterator;
+  private Map<PartitionId, Histogram> localColoPartitionToHistogram;


final for these?

Unfortunately, it cannot be final here because localColoPartitionToHistogram is initialized on demand (That is, it depends on router config and may not be initialized if this is Datacenter level tracker)

cgtz · 2019-06-03T19:42:39Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

+   * @param routerConfig the {@link RouterConfig} that specifies which scope the histogram is associated with.
+   * @return the {@link Histogram} associated with this replica.
+   */
+  Histogram getLatencyHistogram(ReplicaId replicaId, RouterConfig routerConfig) {


why pass in routerConfig to this method? could we always use this.routerConfig?

cgtz · 2019-06-04T01:20:42Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

+   * @param isLocalColo {@code true} if local latency histogram should be returned. {@code false} otherwise.
+   * @return colo-wide latency histogram.
+   */
+  private Histogram getColoWideTracker(NonBlockingRouterMetrics routerMetrics, RouterOperation routerOperation,


why is routerMetrics passed into these two methods?

Add routerMetrics as class member and remove it from methods.

cgtz · 2019-06-04T01:26:49Z

ambry-api/src/main/java/com.github.ambry/router/OperationTrackerScope.java

+ * in a single Histogram)
+ */
+public enum OperationTrackerScope {
+  ColoWide, PartitionLevel


How do you feel about changing the name of these values to Datacenter and Partition. I feel that the level and wide in the names is not needed and that datacenter better matches the terminology in the clustermap.

Excellent point. Will take the suggestion. (Thus, we can avoid the term Colo which may confuse some people outside LinkedIn.)

jsjtzyy self-assigned this May 23, 2019

add tests and java docs

695ee40

jsjtzyy requested review from zzmao and cgtz May 24, 2019 01:28

add more tests

9d3a448

zzmao reviewed May 29, 2019

View reviewed changes

addressed some of Ze's comments

09fde8e

zzmao approved these changes May 30, 2019

View reviewed changes

cgtz approved these changes Jun 4, 2019

View reviewed changes

address Casey's comments

c33df65

cgtz merged commit af1b0d3 into linkedin:master Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing partition-level histogram into adaptive tracker #1174

Introducing partition-level histogram into adaptive tracker #1174

jsjtzyy commented May 23, 2019

jsjtzyy commented May 23, 2019

codecov-io commented May 23, 2019 •

edited

Loading

jsjtzyy commented May 24, 2019

zzmao May 28, 2019

jsjtzyy May 30, 2019

zzmao May 28, 2019

jsjtzyy May 30, 2019

jsjtzyy commented May 30, 2019

cgtz Jun 1, 2019

jsjtzyy Jun 4, 2019

cgtz Jun 1, 2019

jsjtzyy Jun 4, 2019

cgtz Jun 3, 2019

jsjtzyy Jun 4, 2019

cgtz Jun 4, 2019

jsjtzyy Jun 4, 2019

cgtz Jun 4, 2019

jsjtzyy Jun 4, 2019

Introducing partition-level histogram into adaptive tracker #1174

Introducing partition-level histogram into adaptive tracker #1174

Conversation

jsjtzyy commented May 23, 2019

jsjtzyy commented May 23, 2019

codecov-io commented May 23, 2019 • edited Loading

Codecov Report

jsjtzyy commented May 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsjtzyy commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 23, 2019 •

edited

Loading