Avoid "SendSystemError failed" in case of caller timeout #1814

fangzhou37ub · 2019-10-03T21:39:21Z

In the last step of handling inbound request, the handler tried to close response writer for caller. However, if the caller has timed out, the close action would fail since the channel has been closed. In addition, the handler would try to write error message to the writer, which would certainly fail because again, it's closed.
The message is clueless and not very meaningful for engineers.
This PR would not send the redundant error messages (failed to send system error etc.) in case of caller timeout.

CLAassistant · 2019-10-03T21:40:25Z

All committers have signed the CLA.

peats-bond

LGTM, that's for the PR @fangzhou37ub! I'll be happy to stamp after those changes are made.

transport/tchannel/handler_test.go

codecov · 2019-10-11T22:43:40Z

Codecov Report

❗ No coverage uploaded for pull request base (dev@d875647). Click here to learn what that means.
The diff coverage is 100%.

@@          Coverage Diff           @@
##             dev    #1814   +/-   ##
======================================
  Coverage       ?   85.87%           
======================================
  Files          ?      227           
  Lines          ?    11496           
  Branches       ?        0           
======================================
  Hits           ?     9872           
  Misses         ?     1241           
  Partials       ?      383

Impacted Files	Coverage Δ
transport/tchannel/handler.go	`85.61% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d875647...914031f. Read the comment docs.

transport/tchannel/handler_test.go

CHANGELOG.md

transport/tchannel/handler.go

prashantv · 2019-10-30T20:51:22Z

transport/tchannel/handler_test.go

@@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) {
 		})
 	}
 }
+


should we add an integration test where the client makes a call with a short timeout, and the handler blocks until timeout, and we verify that there's no unexpected log messages?

I tried to build integration test inside yarpc (create an inbound server, and send outbound call to that), the error could not be reproduced. Because the outbound yarpc would handle the context cancelling very well.
In reality, the caller of our system is a python based service, where it does not handle the outbound timeout the same way like yarpc. That makes it hard to build integration test inside yarpc to verify this specific behavior.

However, the worst thing could happen is that the log still gets emitted even when the caller timed out. That's the same behavior we currently have. So it would not become worse.

Sorry for the delay here @fangzhou37ub -- can we pass a test logger (e.g., http://go.uber.org/zap/zaptest/observer) to the server, and verify that there's no logs emitted?

Without the change, the test should have logs, it doesn't matter if the caller timeout expires as well, we should still see logs?

kriskowal · 2019-12-03T20:30:45Z

Elsewhere in YARPC, we use plain HTTP clients to call YARPC HTTP clients to emulate bad client behaviors. This might be a useful approach.

…

On Tue, Dec 3, 2019 at 12:26 PM fangzhou37ub ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In transport/tchannel/handler_test.go <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yarpc_yarpc-2Dgo_pull_1814-23discussion-5Fr353404382&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=o2_FInfPB1F6bOfyzBoRV2JYEyviC1S2bnkfsxqfjHY&e=> : > @@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) { }) } } + If we're using yarpc for outbound and inbound integration test, the error would not be reproduced. Because the outbound yarpc would handle the context cancelling very well. I used to try to build the integration test with that. However, the issue could not be reproduced. In reality, the caller of our system is a python based service, where it does not handle the outbound timeout the same way like yarpc. That makes it hard to build integration test inside yarpc to verify this specific behavior. However, the worst thing could happen is that the log still gets emitted even when the caller timed out. That's the same behavior we currently have. So it would not become worse. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yarpc_yarpc-2Dgo_pull_1814-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAAOXBWTOAXF7HB7KJLKQR3QW26IFA5CNFSM4I5IRYUKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCN2N4WA-23discussion-5Fr353404382&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=JGf7wlu40g9hyfcj8W-H4U4UTk-Um-HWgI9DbbeRAZ8&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAAOXBWS2GTMNIPG3VFIOCTQW26IFANCNFSM4I5IRYUA&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=-mHabCODSL9xa3I-KsqWgQ&m=P3D3S70uqLuZ_KXalUS1ZVqyE5iqSBEJUm6IwKTwyCs&s=KCnsSBDOC7R_Eej9POMdg8p4GRpraPOUms06YWALewU&e=> .

fangzhou37ub · 2019-12-03T21:21:33Z

Thanks @kriskowal for your comment.

Elsewhere in YARPC, we use plain HTTP clients to call YARPC HTTP clients to
emulate bad client behaviors.

From my understanding. If the HTTP client takes golang context, then it's not helpful because the Golang would cancel the context and handle the request properly.
E.g. many of the http client in the repo would finally reach this:

func sendRequestAndValidateResp(t testing.TB, out transport.UnaryOutbound, opts api.RequestOpts) {
	f := func(i int) bool {
		resp, cancel, err := sendRequest(out, opts.GiveRequest, opts.GiveTimeout)
		defer cancel()
...

// sendRequest is defined here:
func sendRequest(out transport.UnaryOutbound, request *transport.Request, timeout time.Duration) (*transport.Response, context.CancelFunc, error) {
	ctx, cancel := context.WithTimeout(context.Background(), timeout)
	resp, err := out.Call(ctx, request)
...

Our caller use tchannel-python to send requests. For that, the context would not be cancelled like golang, then the issue appears.

Feel free to let me know if I misunderstand your suggestion.

This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814

peats-bond · 2020-05-18T21:06:11Z

closing in favour of #1933

This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814

…ut (#1933) This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814

AllenLuUber requested review from kriskowal and peats-bond October 7, 2019 22:36

peats-bond reviewed Oct 9, 2019

View reviewed changes

transport/tchannel/handler_test.go Outdated Show resolved Hide resolved

transport/tchannel/handler_test.go Outdated Show resolved Hide resolved

peats-bond added the CI label Oct 9, 2019

UberOpenSourceBot removed the CI label Oct 9, 2019

Avoid "SendSystemError failed" and "responseWriter failed to close"

35bbe7f

fangzhou37ub force-pushed the dev branch from 8382050 to 35bbe7f Compare October 12, 2019 04:46

peats-bond reviewed Oct 16, 2019

View reviewed changes

fangzhou37ub added 2 commits October 17, 2019 11:20

address comments

09ef2ca

merge master

d875647

prashantv reviewed Oct 30, 2019

View reviewed changes

fangzhou37ub added 2 commits December 3, 2019 10:26

Merge upstream

9a78bd2

address feedback: use GetSystemErrorCode instead of getSystemError

914031f

peats-bond mentioned this pull request May 18, 2020

tchannel: Only log system error failures when clients haven't timed out #1933

Merged

1 task

peats-bond closed this May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid "SendSystemError failed" in case of caller timeout #1814

Avoid "SendSystemError failed" in case of caller timeout #1814

fangzhou37ub commented Oct 3, 2019 •

edited

Loading

CLAassistant commented Oct 3, 2019 •

edited

Loading

peats-bond left a comment

codecov bot commented Oct 11, 2019 •

edited

Loading

prashantv Oct 30, 2019

fangzhou37ub Dec 3, 2019 •

edited

Loading

prashantv Jan 9, 2020

kriskowal commented Dec 3, 2019 via email

fangzhou37ub commented Dec 3, 2019 •

edited

Loading

peats-bond commented May 18, 2020

@@ @@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) { @@
               		})
               	}
               }

Avoid "SendSystemError failed" in case of caller timeout #1814

Avoid "SendSystemError failed" in case of caller timeout #1814

Conversation

fangzhou37ub commented Oct 3, 2019 • edited Loading

CLAassistant commented Oct 3, 2019 • edited Loading

peats-bond left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 11, 2019 • edited Loading

Codecov Report

prashantv Oct 30, 2019

Choose a reason for hiding this comment

fangzhou37ub Dec 3, 2019 • edited Loading

Choose a reason for hiding this comment

prashantv Jan 9, 2020

Choose a reason for hiding this comment

kriskowal commented Dec 3, 2019 via email

fangzhou37ub commented Dec 3, 2019 • edited Loading

peats-bond commented May 18, 2020

fangzhou37ub commented Oct 3, 2019 •

edited

Loading

CLAassistant commented Oct 3, 2019 •

edited

Loading

codecov bot commented Oct 11, 2019 •

edited

Loading

fangzhou37ub Dec 3, 2019 •

edited

Loading

fangzhou37ub commented Dec 3, 2019 •

edited

Loading