Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid "SendSystemError failed" in case of caller timeout #1814

Closed
wants to merge 5 commits into from

Conversation

fangzhou37ub
Copy link

@fangzhou37ub fangzhou37ub commented Oct 3, 2019

In the last step of handling inbound request, the handler tried to close response writer for caller. However, if the caller has timed out, the close action would fail since the channel has been closed. In addition, the handler would try to write error message to the writer, which would certainly fail because again, it's closed.
The message is clueless and not very meaningful for engineers.
This PR would not send the redundant error messages (failed to send system error etc.) in case of caller timeout.

Screen Shot 2019-10-03 at 2 38 07 PM

@CLAassistant
Copy link

CLAassistant commented Oct 3, 2019

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@peats-bond peats-bond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, that's for the PR @fangzhou37ub! I'll be happy to stamp after those changes are made.

transport/tchannel/handler_test.go Outdated Show resolved Hide resolved
transport/tchannel/handler_test.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 11, 2019

Codecov Report

❗ No coverage uploaded for pull request base (dev@d875647). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##             dev    #1814   +/-   ##
======================================
  Coverage       ?   85.87%           
======================================
  Files          ?      227           
  Lines          ?    11496           
  Branches       ?        0           
======================================
  Hits           ?     9872           
  Misses         ?     1241           
  Partials       ?      383
Impacted Files Coverage Δ
transport/tchannel/handler.go 85.61% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d875647...914031f. Read the comment docs.

transport/tchannel/handler_test.go Outdated Show resolved Hide resolved
transport/tchannel/handler_test.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
transport/tchannel/handler.go Show resolved Hide resolved
transport/tchannel/handler.go Outdated Show resolved Hide resolved
@@ -663,3 +663,38 @@ func TestGetSystemError(t *testing.T) {
})
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an integration test where the client makes a call with a short timeout, and the handler blocks until timeout, and we verify that there's no unexpected log messages?

Copy link
Author

@fangzhou37ub fangzhou37ub Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to build integration test inside yarpc (create an inbound server, and send outbound call to that), the error could not be reproduced. Because the outbound yarpc would handle the context cancelling very well.
In reality, the caller of our system is a python based service, where it does not handle the outbound timeout the same way like yarpc. That makes it hard to build integration test inside yarpc to verify this specific behavior.

However, the worst thing could happen is that the log still gets emitted even when the caller timed out. That's the same behavior we currently have. So it would not become worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here @fangzhou37ub -- can we pass a test logger (e.g., http://go.uber.org/zap/zaptest/observer) to the server, and verify that there's no logs emitted?

Without the change, the test should have logs, it doesn't matter if the caller timeout expires as well, we should still see logs?

@kriskowal
Copy link
Contributor

kriskowal commented Dec 3, 2019 via email

@fangzhou37ub
Copy link
Author

fangzhou37ub commented Dec 3, 2019

Thanks @kriskowal for your comment.

Elsewhere in YARPC, we use plain HTTP clients to call YARPC HTTP clients to
emulate bad client behaviors.

From my understanding. If the HTTP client takes golang context, then it's not helpful because the Golang would cancel the context and handle the request properly.
E.g. many of the http client in the repo would finally reach this:

func sendRequestAndValidateResp(t testing.TB, out transport.UnaryOutbound, opts api.RequestOpts) {
	f := func(i int) bool {
		resp, cancel, err := sendRequest(out, opts.GiveRequest, opts.GiveTimeout)
		defer cancel()
...

// sendRequest is defined here:
func sendRequest(out transport.UnaryOutbound, request *transport.Request, timeout time.Duration) (*transport.Response, context.CancelFunc, error) {
	ctx, cancel := context.WithTimeout(context.Background(), timeout)
	resp, err := out.Call(ctx, request)
...

Our caller use tchannel-python to send requests. For that, the context would not be cancelled like golang, then the issue appears.

Feel free to let me know if I misunderstand your suggestion.

peats-bond pushed a commit that referenced this pull request May 18, 2020
This fixes a minor logging annoyance. Some users have reported large
amounts of `SendSystemError failed` and `responseWriter failed to
close` logs. This was introduced in #1561 to help improve
observability of TChannel system failures.

However, if a client times out, we'll still log the failures; these
logs are are uninformative/unactionable to users as they are expected.

With this diff, we only log TChannel system failures when the client
is still waiting for a response, ie has yet to time out.

Ref T5305225#107015299, supersedes #1814
peats-bond pushed a commit that referenced this pull request May 18, 2020
This fixes a minor logging annoyance. Some users have reported large
amounts of `SendSystemError failed` and `responseWriter failed to
close` logs. This was introduced in #1561 to help improve
observability of TChannel system failures.

However, if a client times out, we'll still log the failures; these
logs are are uninformative/unactionable to users as they are expected.

With this diff, we only log TChannel system failures when the client
is still waiting for a response, ie has yet to time out.

Ref T5305225#107015299, supersedes #1814
@peats-bond
Copy link
Contributor

closing in favour of #1933

@peats-bond peats-bond closed this May 18, 2020
peats-bond pushed a commit that referenced this pull request May 19, 2020
This fixes a minor logging annoyance. Some users have reported large
amounts of `SendSystemError failed` and `responseWriter failed to
close` logs. This was introduced in #1561 to help improve
observability of TChannel system failures.

However, if a client times out, we'll still log the failures; these
logs are are uninformative/unactionable to users as they are expected.

With this diff, we only log TChannel system failures when the client
is still waiting for a response, ie has yet to time out.

Ref T5305225#107015299, supersedes #1814
peats-bond pushed a commit that referenced this pull request May 19, 2020
…ut (#1933)

This fixes a minor logging annoyance. Some users have reported large
amounts of `SendSystemError failed` and `responseWriter failed to
close` logs. This was introduced in #1561 to help improve
observability of TChannel system failures.

However, if a client times out, we'll still log the failures; these
logs are are uninformative/unactionable to users as they are expected.

With this diff, we only log TChannel system failures when the client
is still waiting for a response, ie has yet to time out.

Ref T5305225#107015299, supersedes #1814
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants