-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid "SendSystemError failed" in case of caller timeout #1814
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, that's for the PR @fangzhou37ub! I'll be happy to stamp after those changes are made.
Codecov Report
@@ Coverage Diff @@
## dev #1814 +/- ##
======================================
Coverage ? 85.87%
======================================
Files ? 227
Lines ? 11496
Branches ? 0
======================================
Hits ? 9872
Misses ? 1241
Partials ? 383
Continue to review full report at Codecov.
|
@@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add an integration test where the client makes a call with a short timeout, and the handler blocks until timeout, and we verify that there's no unexpected log messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to build integration test inside yarpc (create an inbound server, and send outbound call to that), the error could not be reproduced. Because the outbound yarpc would handle the context cancelling very well.
In reality, the caller of our system is a python based service, where it does not handle the outbound timeout the same way like yarpc. That makes it hard to build integration test inside yarpc to verify this specific behavior.
However, the worst thing could happen is that the log still gets emitted even when the caller timed out. That's the same behavior we currently have. So it would not become worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here @fangzhou37ub -- can we pass a test logger (e.g., http://go.uber.org/zap/zaptest/observer) to the server, and verify that there's no logs emitted?
Without the change, the test should have logs, it doesn't matter if the caller timeout expires as well, we should still see logs?
Elsewhere in YARPC, we use plain HTTP clients to call YARPC HTTP clients to
emulate bad client behaviors. This might be a useful approach.
…On Tue, Dec 3, 2019 at 12:26 PM fangzhou37ub ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In transport/tchannel/handler_test.go
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yarpc_yarpc-2Dgo_pull_1814-23discussion-5Fr353404382&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=o2_FInfPB1F6bOfyzBoRV2JYEyviC1S2bnkfsxqfjHY&e=>
:
> @@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) {
})
}
}
+
If we're using yarpc for outbound and inbound integration test, the error
would not be reproduced. Because the outbound yarpc would handle the
context cancelling very well.
I used to try to build the integration test with that. However, the issue
could not be reproduced.
In reality, the caller of our system is a python based service, where it
does not handle the outbound timeout the same way like yarpc. That makes it
hard to build integration test inside yarpc to verify this specific
behavior.
However, the worst thing could happen is that the log still gets emitted
even when the caller timed out. That's the same behavior we currently have.
So it would not become worse.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yarpc_yarpc-2Dgo_pull_1814-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAAOXBWTOAXF7HB7KJLKQR3QW26IFA5CNFSM4I5IRYUKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCN2N4WA-23discussion-5Fr353404382&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=JGf7wlu40g9hyfcj8W-H4U4UTk-Um-HWgI9DbbeRAZ8&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAAOXBWS2GTMNIPG3VFIOCTQW26IFANCNFSM4I5IRYUA&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=KCnsSBDOC7R_Eej9POMdg8p4GRpraPOUms06YWALewU&e=>
.
|
Thanks @kriskowal for your comment.
From my understanding. If the HTTP client takes golang context, then it's not helpful because the Golang would cancel the context and handle the request properly.
Our caller use tchannel-python to send requests. For that, the context would not be cancelled like golang, then the issue appears. Feel free to let me know if I misunderstand your suggestion. |
This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814
This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814
closing in favour of #1933 |
This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814
…ut (#1933) This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814
In the last step of handling inbound request, the handler tried to close response writer for caller. However, if the caller has timed out, the close action would fail since the channel has been closed. In addition, the handler would try to write error message to the writer, which would certainly fail because again, it's closed.
The message is clueless and not very meaningful for engineers.
This PR would not send the redundant error messages (failed to send system error etc.) in case of caller timeout.