Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude timed out response from adaptive tracker's histogram #1173

Merged
merged 8 commits into from
May 17, 2019

Conversation

jsjtzyy
Copy link
Contributor

@jsjtzyy jsjtzyy commented May 16, 2019

  1. Introduce router error code into operation tracker's onResponse().
  2. Update histogram when error code doesn't equal OperationTimedOut.
    This will help adaptive tracker keep valid latency data points in
    histogram and will issue 2nd request in time if server crashed or is
    down for deployment.

1. Introduce router error code into operation tracker's onResponse().
2. Update histogram when error code doesn't equal OperationTimedOut.
This will help adaptive tracker keep valid latency data points in
histogram and will issue 2nd request in time if server crashed or is
down for deployment.
@jsjtzyy jsjtzyy requested review from zzmao and pnarayanan May 16, 2019 05:03
@jsjtzyy jsjtzyy self-assigned this May 16, 2019
@codecov-io
Copy link

codecov-io commented May 16, 2019

Codecov Report

Merging #1173 into master will decrease coverage by 0.03%.
The diff coverage is 87.09%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master   #1173      +/-   ##
===========================================
- Coverage     70.03%     70%   -0.04%     
+ Complexity     5370    5368       -2     
===========================================
  Files           427     428       +1     
  Lines         32785   32791       +6     
  Branches       4133    4136       +3     
===========================================
- Hits          22961   22955       -6     
- Misses         8694    8698       +4     
- Partials       1130    1138       +8
Impacted Files Coverage Δ Complexity Δ
.../java/com.github.ambry.router/DeleteOperation.java 92.19% <100%> (-1.42%) 48 <4> (ø)
....github.ambry.router/TrackedRequestFinalState.java 100% <100%> (ø) 1 <1> (?)
...om.github.ambry.router/SimpleOperationTracker.java 88.07% <100%> (ø) 31 <0> (ø) ⬇️
....github.ambry.router/AdaptiveOperationTracker.java 95.91% <100%> (+0.08%) 7 <3> (+1) ⬆️
...ain/java/com.github.ambry.router/PutOperation.java 90.86% <100%> (-0.18%) 107 <0> (-1)
...va/com.github.ambry.router/TtlUpdateOperation.java 87.67% <100%> (ø) 51 <4> (+1) ⬆️
...n/java/com.github.ambry.network/NetworkClient.java 95.65% <50%> (ø) 28 <0> (ø) ⬇️
.../com.github.ambry.router/GetBlobInfoOperation.java 85.31% <82.6%> (+0.56%) 42 <3> (+2) ⬆️
...java/com.github.ambry.router/GetBlobOperation.java 91.68% <86.95%> (+0.04%) 39 <1> (+1) ⬆️
.../main/java/com.github.ambry.router/EncryptJob.java 92.1% <0%> (-5.27%) 4% <0%> (-1%)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5694e12...bc9af85. Read the comment docs.

long elapsedTime;
if (unexpiredRequestSendTimes.containsKey(replicaId)) {
elapsedTime = time.milliseconds() - unexpiredRequestSendTimes.remove(replicaId).getSecond();
} else {
elapsedTime = time.milliseconds() - expiredRequestSendTimes.remove(replicaId);
}
getLatencyHistogram(replicaId).update(elapsedTime);
if (routerErrorCode != RouterErrorCode.OperationTimedOut) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any concern about routerErrorCode == null here, please let me know.

@jsjtzyy jsjtzyy requested a review from cgtz May 16, 2019 17:24
@@ -77,15 +77,17 @@
}

@Override
public void onResponse(ReplicaId replicaId, boolean isSuccessFul) {
super.onResponse(replicaId, isSuccessFul);
public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just passing boolean updateLatencyHistogram instead of the error code? This will keep the tracker agnostic to the meaning of different error codes and will match what was done with the isSuccessful variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. After reconsideration, it might be reasonable to keep router error code in method signature. Two humble thoughts: 1. I changed the onResponse signature in OperationTracker interface which is implemented by SimpleOperationTracker and AdaptiveOperationTracker. It seems updateLatencyHistogram doesn't apply to SimpleOperationTracker. 2. Passing router error code might give us some benefits in the future if we want to change behavior of both operation trackers based on the router error code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSuccessFul is not able to provided enough information, so you introduce RouterErrorCode.
But with both isSuccessFul and RouterErrorCode, it's a little redundant.

So I think there are two options,

  1. Just pass in RouterErrorCode, internally distinguish isSuccess and shoudUpdate.
  2. Instead of passing in boolean, we can pass in a int(enum), different type do different things.

Looks like 1 is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, you are discussing in the thread below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took @cgtz 's suggestion and changed the method signature.

@jsjtzyy
Copy link
Contributor Author

jsjtzyy commented May 16, 2019

./gradlew build && ./gradlew test succeeded. @pnarayanan @zzmao reminder to review. Thanks!

@@ -77,15 +77,17 @@
}

@Override
public void onResponse(ReplicaId replicaId, boolean isSuccessFul) {
super.onResponse(replicaId, isSuccessFul);
public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about getting rid of the boolean in that case and determine success or failure purely based on the error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is viable but a little aggressive in this PR. I see your points here, instead of relying on the caller, the operation tracker should resolve the isSuccessful status by itself. Personally, I am a little conservative on this, maybe we can remove that in the future PR. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, you may not be able to determine success or failure based on the error code alone, as the interpretation might be different for different operation types (for example, Blob_Deleted is an error for a Get, but perhaps not for a Delete operation). It would still be nice if the signature can be made more elegant, but I dont' have a good suggestion at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could create another enum to represent how the operation tracker should handle the response, but abstracts out the actual error code.

e.g.: enum OperationTracker.Status {SUCCESS, FAILURE, TIMED_OUT }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good idea. Let me make the changes as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead using of OperationTracker.Status, I decide to use the RequestResult here. It is more reasonable in context of onResponse method. Also RequestResult is more general, it includes the case where request is not successfully sent out due to connection checkout timeout etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestResult is a very common name. A more descriptive name would be good.

Also it would be great if add some comments about the case @pnarayanan mentioned, to remind us in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the enum name to RouterRequestFinalState

NonBlockingRouter.currentOperationsCount.incrementAndGet();
GetBlobInfoOperation op =
new GetBlobInfoOperation(routerConfig, routerMetrics, mockClusterMap, responseHandler, blobId, options, null,
routerCallback, kms, cryptoService, cryptoJobHandler, time, false);
requestRegistrationCallback.requestListToFill = new ArrayList<>();
op.poll(requestRegistrationCallback);
int count = 0;
while (!op.isOperationComplete()) {
time.sleep(routerConfig.routerRequestTimeoutMs + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce the default timeout.

Copy link
Contributor

@zzmao zzmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cgtz cgtz merged commit 09f0d30 into linkedin:master May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants