Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing partition-level histogram into adaptive tracker #1174

Merged
merged 5 commits into from
Jun 4, 2019

Conversation

jsjtzyy
Copy link
Contributor

@jsjtzyy jsjtzyy commented May 23, 2019

  1. Introduced partition-level histograms that keep track of latency of
    requests against each partition separately.
  2. Introduced OperationTrackerScope that allows user to choose either
    ColoWide or PartitionLevel histogram in adaptive tracker.
  3. Make reservoir size and decay factor configurable in Histogram.

1. Introduced partition-level histograms that keep track of latency of
requests against each partition separately.
2. Introduced OperationTrackerScope that allows user to choose either
ColoWide or PartitionLevel histogram in adaptive tracker.
3. Make reservoir size and decay factor configurable in Histogram.
@jsjtzyy jsjtzyy self-assigned this May 23, 2019
@jsjtzyy
Copy link
Contributor Author

jsjtzyy commented May 23, 2019

Initial commit. Keep adding java docs and tests.

@codecov-io
Copy link

codecov-io commented May 23, 2019

Codecov Report

Merging #1174 into master will decrease coverage by 0.37%.
The diff coverage is 95.34%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1174      +/-   ##
============================================
- Coverage     70.06%   69.69%   -0.38%     
- Complexity     5378     5396      +18     
============================================
  Files           428      430       +2     
  Lines         32791    33015     +224     
  Branches       4136     4173      +37     
============================================
+ Hits          22975    23009      +34     
- Misses         8691     8866     +175     
- Partials       1125     1140      +15
Impacted Files Coverage Δ Complexity Δ
...com.github.ambry/router/OperationTrackerScope.java 100% <100%> (ø) 1 <1> (?)
...java/com.github.ambry.router/GetBlobOperation.java 91.68% <100%> (ø) 39 <0> (ø) ⬇️
...ain/java/com.github.ambry/config/RouterConfig.java 100% <100%> (ø) 1 <0> (ø) ⬇️
....github.ambry.router/NonBlockingRouterMetrics.java 94.14% <100%> (+0.22%) 46 <3> (+5) ⬆️
...ain/java/com.github.ambry.router/GetOperation.java 96.92% <100%> (-0.14%) 27 <0> (ø)
.../com.github.ambry.router/GetBlobInfoOperation.java 85.31% <100%> (ø) 42 <0> (ø) ⬇️
....github.ambry.router/AdaptiveOperationTracker.java 93.02% <90.9%> (-2.9%) 24 <16> (+17)
...com.github.ambry.clustermap/HelixAdminFactory.java 50% <0%> (-50%) 1% <0%> (-1%)
...ava/com.github.ambry.cloud/CloudBackupManager.java 43.95% <0%> (-39.83%) 4% <0%> (+1%)
...m.github.ambry.replication/ReplicationManager.java 51.06% <0%> (-39.57%) 3% <0%> (-1%)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09f0d30...c33df65. Read the comment docs.

@jsjtzyy jsjtzyy requested review from zzmao and cgtz May 24, 2019 01:28
@jsjtzyy
Copy link
Contributor Author

jsjtzyy commented May 24, 2019

./gradlew build && ./gradlew test succeeded. @zzmao @cgtz the PR is ready for review.

@@ -37,15 +37,17 @@
* perceived latencies.
*/
class AdaptiveOperationTracker extends SimpleOperationTracker {
static final long MIN_DATA_POINTS_REQUIRED = 1000;

private final RouterConfig routerConfig;
private final Time time;
private final double quantile;
private final Histogram localColoTracker;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's time to rename *Tracker to * Histogram .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private final Time time;
private final double quantile;
private final Histogram localColoTracker;
private final Histogram crossColoTracker;
private final Counter pastDueCounter;
private final OpTrackerIterator otIterator;
private Iterator<ReplicaId> replicaIterator;
private Map<PartitionId, Histogram> localColoPartitionAndLatency;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to localColoPartitionToHistogram? or localColoHistogramByPartition`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@jsjtzyy
Copy link
Contributor Author

jsjtzyy commented May 30, 2019

Addressed @zzmao 's comments. @cgtz gentle reminder to review.

for (OperationTrackerScope scope : OperationTrackerScope.values()) {
validTrackerScopes.add(scope.toString());
}
routerOperationTrackerMetricScope = validTrackerScopes.contains(scopeStr) ? OperationTrackerScope.valueOf(scopeStr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just throw an exception if the operator sets an invalid config value? I think the current behavior might hide config typos.

You could then express this code as just routerOperationTrackerMetricScope = OperationTrackerScope.valueOf(scopeStr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I thought we should allow frontend to start up even though the scope is invalid (by using default scope). Taking your point into consideration, I feel like we should explicitly throw exception to remind DEV/SRE the config is invalid as opposed to using default scope that we are not even aware of.
I will make the change.

private final Counter pastDueCounter;
private final OpTrackerIterator otIterator;
private Iterator<ReplicaId> replicaIterator;
private Map<PartitionId, Histogram> localColoPartitionToHistogram;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it cannot be final here because localColoPartitionToHistogram is initialized on demand (That is, it depends on router config and may not be initialized if this is Datacenter level tracker)

* @param routerConfig the {@link RouterConfig} that specifies which scope the histogram is associated with.
* @return the {@link Histogram} associated with this replica.
*/
Histogram getLatencyHistogram(ReplicaId replicaId, RouterConfig routerConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass in routerConfig to this method? could we always use this.routerConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* @param isLocalColo {@code true} if local latency histogram should be returned. {@code false} otherwise.
* @return colo-wide latency histogram.
*/
private Histogram getColoWideTracker(NonBlockingRouterMetrics routerMetrics, RouterOperation routerOperation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is routerMetrics passed into these two methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add routerMetrics as class member and remove it from methods.

* in a single Histogram)
*/
public enum OperationTrackerScope {
ColoWide, PartitionLevel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about changing the name of these values to Datacenter and Partition. I feel that the level and wide in the names is not needed and that datacenter better matches the terminology in the clustermap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. Will take the suggestion. (Thus, we can avoid the term Colo which may confuse some people outside LinkedIn.)

@cgtz cgtz merged commit af1b0d3 into linkedin:master Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants