Exclude timed out response from adaptive tracker's histogram #1173

jsjtzyy · 2019-05-16T05:03:45Z

Introduce router error code into operation tracker's onResponse().
Update histogram when error code doesn't equal OperationTimedOut.
This will help adaptive tracker keep valid latency data points in
histogram and will issue 2nd request in time if server crashed or is
down for deployment.

1. Introduce router error code into operation tracker's onResponse(). 2. Update histogram when error code doesn't equal OperationTimedOut. This will help adaptive tracker keep valid latency data points in histogram and will issue 2nd request in time if server crashed or is down for deployment.

codecov-io · 2019-05-16T05:17:33Z

Codecov Report

Merging #1173 into master will decrease coverage by 0.03%.
The diff coverage is 87.09%.

@@             Coverage Diff             @@
##             master   #1173      +/-   ##
===========================================
- Coverage     70.03%     70%   -0.04%     
+ Complexity     5370    5368       -2     
===========================================
  Files           427     428       +1     
  Lines         32785   32791       +6     
  Branches       4133    4136       +3     
===========================================
- Hits          22961   22955       -6     
- Misses         8694    8698       +4     
- Partials       1130    1138       +8

Impacted Files	Coverage Δ	Complexity Δ
.../java/com.github.ambry.router/DeleteOperation.java	`92.19% <100%> (-1.42%)`	`48 <4> (ø)`
....github.ambry.router/TrackedRequestFinalState.java	`100% <100%> (ø)`	`1 <1> (?)`
...om.github.ambry.router/SimpleOperationTracker.java	`88.07% <100%> (ø)`	`31 <0> (ø)`	⬇️
....github.ambry.router/AdaptiveOperationTracker.java	`95.91% <100%> (+0.08%)`	`7 <3> (+1)`	⬆️
...ain/java/com.github.ambry.router/PutOperation.java	`90.86% <100%> (-0.18%)`	`107 <0> (-1)`
...va/com.github.ambry.router/TtlUpdateOperation.java	`87.67% <100%> (ø)`	`51 <4> (+1)`	⬆️
...n/java/com.github.ambry.network/NetworkClient.java	`95.65% <50%> (ø)`	`28 <0> (ø)`	⬇️
.../com.github.ambry.router/GetBlobInfoOperation.java	`85.31% <82.6%> (+0.56%)`	`42 <3> (+2)`	⬆️
...java/com.github.ambry.router/GetBlobOperation.java	`91.68% <86.95%> (+0.04%)`	`39 <1> (+1)`	⬆️
.../main/java/com.github.ambry.router/EncryptJob.java	`92.1% <0%> (-5.27%)`	`4% <0%> (-1%)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5694e12...bc9af85. Read the comment docs.

jsjtzyy · 2019-05-16T16:51:15Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

    long elapsedTime;
    if (unexpiredRequestSendTimes.containsKey(replicaId)) {
      elapsedTime = time.milliseconds() - unexpiredRequestSendTimes.remove(replicaId).getSecond();
    } else {
      elapsedTime = time.milliseconds() - expiredRequestSendTimes.remove(replicaId);
    }
-    getLatencyHistogram(replicaId).update(elapsedTime);
+    if (routerErrorCode != RouterErrorCode.OperationTimedOut) {


If any concern about routerErrorCode == null here, please let me know.

cgtz · 2019-05-16T17:30:10Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

@@ -77,15 +77,17 @@
  }

  @Override
-  public void onResponse(ReplicaId replicaId, boolean isSuccessFul) {
-    super.onResponse(replicaId, isSuccessFul);
+  public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) {


how about just passing boolean updateLatencyHistogram instead of the error code? This will keep the tracker agnostic to the meaning of different error codes and will match what was done with the isSuccessful variable.

Thanks for the suggestion. After reconsideration, it might be reasonable to keep router error code in method signature. Two humble thoughts: 1. I changed the onResponse signature in OperationTracker interface which is implemented by SimpleOperationTracker and AdaptiveOperationTracker. It seems updateLatencyHistogram doesn't apply to SimpleOperationTracker. 2. Passing router error code might give us some benefits in the future if we want to change behavior of both operation trackers based on the router error code.

isSuccessFul is not able to provided enough information, so you introduce RouterErrorCode.
But with both isSuccessFul and RouterErrorCode, it's a little redundant.

So I think there are two options,

Just pass in RouterErrorCode, internally distinguish isSuccess and shoudUpdate.

Instead of passing in boolean, we can pass in a int(enum), different type do different things.

Looks like 1 is better.

ha, you are discussing in the thread below.

took @cgtz 's suggestion and changed the method signature.

jsjtzyy · 2019-05-16T20:20:58Z

./gradlew build && ./gradlew test succeeded. @pnarayanan @zzmao reminder to review. Thanks!

pnarayanan · 2019-05-16T23:13:46Z

ambry-router/src/main/java/com.github.ambry.router/AdaptiveOperationTracker.java

@@ -77,15 +77,17 @@
  }

  @Override
-  public void onResponse(ReplicaId replicaId, boolean isSuccessFul) {
-    super.onResponse(replicaId, isSuccessFul);
+  public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) {


How about getting rid of the boolean in that case and determine success or failure purely based on the error code?

It is viable but a little aggressive in this PR. I see your points here, instead of relying on the caller, the operation tracker should resolve the isSuccessful status by itself. Personally, I am a little conservative on this, maybe we can remove that in the future PR. What do you think?

On second thought, you may not be able to determine success or failure based on the error code alone, as the interpretation might be different for different operation types (for example, Blob_Deleted is an error for a Get, but perhaps not for a Delete operation). It would still be nice if the signature can be made more elegant, but I dont' have a good suggestion at this point.

Perhaps we could create another enum to represent how the operation tracker should handle the response, but abstracts out the actual error code.

e.g.: enum OperationTracker.Status {SUCCESS, FAILURE, TIMED_OUT }

sounds like a good idea. Let me make the changes as suggested.

Instead using of OperationTracker.Status, I decide to use the RequestResult here. It is more reasonable in context of onResponse method. Also RequestResult is more general, it includes the case where request is not successfully sent out due to connection checkout timeout etc.

RequestResult is a very common name. A more descriptive name would be good.

Also it would be great if add some comments about the case @pnarayanan mentioned, to remind us in the future.

changed the enum name to RouterRequestFinalState

zzmao · 2019-05-17T17:38:49Z

ambry-router/src/test/java/com.github.ambry.router/GetBlobInfoOperationTest.java

    NonBlockingRouter.currentOperationsCount.incrementAndGet();
    GetBlobInfoOperation op =
        new GetBlobInfoOperation(routerConfig, routerMetrics, mockClusterMap, responseHandler, blobId, options, null,
            routerCallback, kms, cryptoService, cryptoJobHandler, time, false);
    requestRegistrationCallback.requestListToFill = new ArrayList<>();
    op.poll(requestRegistrationCallback);
+    int count = 0;
    while (!op.isOperationComplete()) {
      time.sleep(routerConfig.routerRequestTimeoutMs + 1);


Reduce the default timeout.

zzmao

LGTM.

jsjtzyy requested review from zzmao and pnarayanan May 16, 2019 05:03

jsjtzyy self-assigned this May 16, 2019

jsjtzyy commented May 16, 2019

View reviewed changes

clean code

36fb482

jsjtzyy requested a review from cgtz May 16, 2019 17:24

cgtz suggested changes May 16, 2019

View reviewed changes

added java doc for two methods

61b7714

pnarayanan reviewed May 16, 2019

View reviewed changes

pnarayanan approved these changes May 16, 2019

View reviewed changes

introduce RequsetResult to address the comments

8d5e984

zzmao reviewed May 17, 2019

View reviewed changes

address ze's comments

d88a815

zzmao approved these changes May 17, 2019

View reviewed changes

jsjtzyy added 3 commits May 17, 2019 11:29

maker timeout in test even shorter

c04c41c

rename the enum name as suggested

9cbafab

rename the enum to TrackedRequestFinalState to make it clear

bc9af85

cgtz approved these changes May 17, 2019

View reviewed changes

cgtz merged commit 09f0d30 into linkedin:master May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude timed out response from adaptive tracker's histogram #1173

Exclude timed out response from adaptive tracker's histogram #1173

jsjtzyy commented May 16, 2019

codecov-io commented May 16, 2019 •

edited

Loading

jsjtzyy May 16, 2019

cgtz May 16, 2019

jsjtzyy May 16, 2019

zzmao May 17, 2019

zzmao May 17, 2019

jsjtzyy May 17, 2019

jsjtzyy commented May 16, 2019

pnarayanan May 16, 2019

jsjtzyy May 16, 2019

pnarayanan May 16, 2019

cgtz May 17, 2019

jsjtzyy May 17, 2019

jsjtzyy May 17, 2019

zzmao May 17, 2019

jsjtzyy May 17, 2019

zzmao May 17, 2019

zzmao left a comment

Exclude timed out response from adaptive tracker's histogram #1173

Exclude timed out response from adaptive tracker's histogram #1173

Conversation

jsjtzyy commented May 16, 2019

codecov-io commented May 16, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsjtzyy commented May 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzmao left a comment

Choose a reason for hiding this comment

codecov-io commented May 16, 2019 •

edited

Loading