-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing partition-level histogram into adaptive tracker #1174
Conversation
1. Introduced partition-level histograms that keep track of latency of requests against each partition separately. 2. Introduced OperationTrackerScope that allows user to choose either ColoWide or PartitionLevel histogram in adaptive tracker. 3. Make reservoir size and decay factor configurable in Histogram.
Initial commit. Keep adding java docs and tests. |
Codecov Report
@@ Coverage Diff @@
## master #1174 +/- ##
============================================
- Coverage 70.06% 69.69% -0.38%
- Complexity 5378 5396 +18
============================================
Files 428 430 +2
Lines 32791 33015 +224
Branches 4136 4173 +37
============================================
+ Hits 22975 23009 +34
- Misses 8691 8866 +175
- Partials 1125 1140 +15
Continue to review full report at Codecov.
|
@@ -37,15 +37,17 @@ | |||
* perceived latencies. | |||
*/ | |||
class AdaptiveOperationTracker extends SimpleOperationTracker { | |||
static final long MIN_DATA_POINTS_REQUIRED = 1000; | |||
|
|||
private final RouterConfig routerConfig; | |||
private final Time time; | |||
private final double quantile; | |||
private final Histogram localColoTracker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's time to rename *Tracker
to * Histogram
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private final Time time; | ||
private final double quantile; | ||
private final Histogram localColoTracker; | ||
private final Histogram crossColoTracker; | ||
private final Counter pastDueCounter; | ||
private final OpTrackerIterator otIterator; | ||
private Iterator<ReplicaId> replicaIterator; | ||
private Map<PartitionId, Histogram> localColoPartitionAndLatency; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to localColoPartitionToHistogram
? or localColoHistogramByPartition`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
for (OperationTrackerScope scope : OperationTrackerScope.values()) { | ||
validTrackerScopes.add(scope.toString()); | ||
} | ||
routerOperationTrackerMetricScope = validTrackerScopes.contains(scopeStr) ? OperationTrackerScope.valueOf(scopeStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just throw an exception if the operator sets an invalid config value? I think the current behavior might hide config typos.
You could then express this code as just routerOperationTrackerMetricScope = OperationTrackerScope.valueOf(scopeStr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I thought we should allow frontend to start up even though the scope is invalid (by using default scope). Taking your point into consideration, I feel like we should explicitly throw exception to remind DEV/SRE the config is invalid as opposed to using default scope that we are not even aware of.
I will make the change.
private final Counter pastDueCounter; | ||
private final OpTrackerIterator otIterator; | ||
private Iterator<ReplicaId> replicaIterator; | ||
private Map<PartitionId, Histogram> localColoPartitionToHistogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it cannot be final
here because localColoPartitionToHistogram
is initialized on demand (That is, it depends on router config and may not be initialized if this is Datacenter level tracker)
* @param routerConfig the {@link RouterConfig} that specifies which scope the histogram is associated with. | ||
* @return the {@link Histogram} associated with this replica. | ||
*/ | ||
Histogram getLatencyHistogram(ReplicaId replicaId, RouterConfig routerConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why pass in routerConfig
to this method? could we always use this.routerConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
* @param isLocalColo {@code true} if local latency histogram should be returned. {@code false} otherwise. | ||
* @return colo-wide latency histogram. | ||
*/ | ||
private Histogram getColoWideTracker(NonBlockingRouterMetrics routerMetrics, RouterOperation routerOperation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is routerMetrics passed into these two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add routerMetrics
as class member and remove it from methods.
* in a single Histogram) | ||
*/ | ||
public enum OperationTrackerScope { | ||
ColoWide, PartitionLevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about changing the name of these values to Datacenter
and Partition
. I feel that the level and wide in the names is not needed and that datacenter better matches the terminology in the clustermap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point. Will take the suggestion. (Thus, we can avoid the term Colo
which may confuse some people outside LinkedIn.)
requests against each partition separately.
ColoWide or PartitionLevel histogram in adaptive tracker.