-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude timed out response from adaptive tracker's histogram #1173
Conversation
1. Introduce router error code into operation tracker's onResponse(). 2. Update histogram when error code doesn't equal OperationTimedOut. This will help adaptive tracker keep valid latency data points in histogram and will issue 2nd request in time if server crashed or is down for deployment.
Codecov Report
@@ Coverage Diff @@
## master #1173 +/- ##
===========================================
- Coverage 70.03% 70% -0.04%
+ Complexity 5370 5368 -2
===========================================
Files 427 428 +1
Lines 32785 32791 +6
Branches 4133 4136 +3
===========================================
- Hits 22961 22955 -6
- Misses 8694 8698 +4
- Partials 1130 1138 +8
Continue to review full report at Codecov.
|
long elapsedTime; | ||
if (unexpiredRequestSendTimes.containsKey(replicaId)) { | ||
elapsedTime = time.milliseconds() - unexpiredRequestSendTimes.remove(replicaId).getSecond(); | ||
} else { | ||
elapsedTime = time.milliseconds() - expiredRequestSendTimes.remove(replicaId); | ||
} | ||
getLatencyHistogram(replicaId).update(elapsedTime); | ||
if (routerErrorCode != RouterErrorCode.OperationTimedOut) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any concern about routerErrorCode == null
here, please let me know.
@@ -77,15 +77,17 @@ | |||
} | |||
|
|||
@Override | |||
public void onResponse(ReplicaId replicaId, boolean isSuccessFul) { | |||
super.onResponse(replicaId, isSuccessFul); | |||
public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about just passing boolean updateLatencyHistogram
instead of the error code? This will keep the tracker agnostic to the meaning of different error codes and will match what was done with the isSuccessful
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. After reconsideration, it might be reasonable to keep router error code in method signature. Two humble thoughts: 1. I changed the onResponse
signature in OperationTracker
interface which is implemented by SimpleOperationTracker
and AdaptiveOperationTracker
. It seems updateLatencyHistogram
doesn't apply to SimpleOperationTracker
. 2. Passing router error code might give us some benefits in the future if we want to change behavior of both operation trackers based on the router error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSuccessFul
is not able to provided enough information, so you introduce RouterErrorCode
.
But with both isSuccessFul
and RouterErrorCode
, it's a little redundant.
So I think there are two options,
- Just pass in RouterErrorCode, internally distinguish
isSuccess
andshoudUpdate
. - Instead of passing in boolean, we can pass in a int(enum), different type do different things.
Looks like 1 is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, you are discussing in the thread below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took @cgtz 's suggestion and changed the method signature.
./gradlew build && ./gradlew test succeeded. @pnarayanan @zzmao reminder to review. Thanks! |
@@ -77,15 +77,17 @@ | |||
} | |||
|
|||
@Override | |||
public void onResponse(ReplicaId replicaId, boolean isSuccessFul) { | |||
super.onResponse(replicaId, isSuccessFul); | |||
public void onResponse(ReplicaId replicaId, boolean isSuccessFul, RouterErrorCode routerErrorCode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about getting rid of the boolean
in that case and determine success or failure purely based on the error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is viable but a little aggressive in this PR. I see your points here, instead of relying on the caller, the operation tracker should resolve the isSuccessful
status by itself. Personally, I am a little conservative on this, maybe we can remove that in the future PR. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, you may not be able to determine success or failure based on the error code alone, as the interpretation might be different for different operation types (for example, Blob_Deleted
is an error for a Get, but perhaps not for a Delete operation). It would still be nice if the signature can be made more elegant, but I dont' have a good suggestion at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could create another enum to represent how the operation tracker should handle the response, but abstracts out the actual error code.
e.g.: enum OperationTracker.Status {SUCCESS, FAILURE, TIMED_OUT }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good idea. Let me make the changes as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead using of OperationTracker.Status
, I decide to use the RequestResult
here. It is more reasonable in context of onResponse
method. Also RequestResult
is more general, it includes the case where request is not successfully sent out due to connection checkout timeout etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RequestResult
is a very common name. A more descriptive name would be good.
Also it would be great if add some comments about the case @pnarayanan mentioned, to remind us in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the enum name to RouterRequestFinalState
NonBlockingRouter.currentOperationsCount.incrementAndGet(); | ||
GetBlobInfoOperation op = | ||
new GetBlobInfoOperation(routerConfig, routerMetrics, mockClusterMap, responseHandler, blobId, options, null, | ||
routerCallback, kms, cryptoService, cryptoJobHandler, time, false); | ||
requestRegistrationCallback.requestListToFill = new ArrayList<>(); | ||
op.poll(requestRegistrationCallback); | ||
int count = 0; | ||
while (!op.isOperationComplete()) { | ||
time.sleep(routerConfig.routerRequestTimeoutMs + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce the default timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This will help adaptive tracker keep valid latency data points in
histogram and will issue 2nd request in time if server crashed or is
down for deployment.