Adding logging to tchannel handler error cases #1561

r-hang · 2018-09-11T18:13:58Z

Zap logging was added to understand when handler calls failed, when
SendSystemError failed, and when responseWriter failed to close.
To test these changes, I added a responseWriter interface factory to
the struct of the handler so I could control reponseWriter during
testing. A recorder interface was also added to test failure cases
n responseRecorder.

CLAassistant · 2018-09-11T18:14:04Z

All committers have signed the CLA.

twilly · 2018-09-11T23:44:05Z

internal/observability/middleware_test.go

@@ -589,5 +589,4 @@ func TestMiddlewareFailureSnapshot(t *testing.T) {
 			},
 		},
 	}
-	assert.Equal(t, want, snap, "Unexpected snapshot of metrics.")
-}
+	assert.Equal(t, want, snap, "Unexpected snapshot of metrics."}


It seems like you have two commits changing things back and forth. That's pretty confusing, so squash the changes into one or more logical, independently correct commits. For example, one commit that makes the change to responseWriter interface with new tests and another with the logging feature.

twilly · 2018-09-11T23:46:26Z

transport/tchannel/channel_transport.go

+
+			if t.logger == nil {
+				t.logger = zap.NewNop()
+			}


t is available at a higher scope and you're banging on t.logger in a loop when it's invariant. If you intend to have a side effect in start then it should be at the top of start.

Such a side effect is a bit confusing since newChannelTransport already sets a nop logger. This block doesn't seem necessary.

twilly · 2018-09-12T00:12:20Z

transport/tchannel/handler.go

 		_ = call.Response().SendSystemError(getSystemError(err))
+		h.logger.Error("tchannel callHandler error", zap.Error(err))


"callHandler error" is a bit generic. This could be more descriptive, like "tchannel send system error."

twilly · 2018-09-12T00:13:21Z

transport/tchannel/handler.go

-	response           inboundCallResponse
-	isApplicationError bool
-	headerCase         headerCase
+type tchannelResponseWriter struct {


Stuttering: tchannel.tchannelResponseWriter

I agree and was thinking about this for a while. I felt that newResponseWriter was the best name for the interface (since it was the most generic) and this was the best name I thought of for the original responseWriter struct - any suggestions for a better name?

twilly · 2018-09-12T00:16:45Z

transport/tchannel/handler.go

+	Buffer           *bufferpool.Buffer
+	Response         inboundCallResponse
+	ApplicationError bool
+	HeaderCase       headerCase


Do we need to change this struct at all?

I believe so. From my understanding, the fields of handlerWriter to be exposed as public in order to allow faultyResponseWriter struct{ handlerWriter } to embed its methods.

I don't think we need to modify this since the use is in faultyHandlerWriter and its methods don't need access to these fields.

https://play.golang.org/p/qLbk_7Kwi1_U

twilly · 2018-09-12T23:07:52Z

transport/tchannel/handler_test.go

 		mockCtrl.Finish()
 	}
 }

 func TestResponseWriter(t *testing.T) {
-	tests := []struct {
+	type test struct {


No need to make a local type for this. The previous form is conical.

twilly · 2018-09-12T23:08:51Z

transport/tchannel/handler_test.go

@@ -510,10 +602,12 @@ func TestResponseWriterEmptyBodyHeaders(t *testing.T) {
 }

 func TestGetSystemError(t *testing.T) {
-	tests := []struct {
+	type test struct {


twilly · 2018-09-12T23:12:44Z

transport/tchannel/tchannel_utils_test.go

+
+// faultyHandlerWriter mocks a responseWriter.Close() error to test logging behaviour
+// inside tchannel.Handle.
+type faultyHandlerWriter struct{ handlerWriter }


You could do this:

type faultyHandlerWriter struct{} func newFaultyHandlerWriter(response inboundCallResponse, format tchannel.Format, headerCase headerCase) responseWriter { return &faultyHandlerWriter{} }

Then you don't need to change the fields in handlerWriter and record the options passed in that you don't use.

twilly · 2018-09-12T23:23:08Z

transport/tchannel/handler.go

 		_ = call.Response().SendSystemError(getSystemError(err))
+		h.logger.Error("tchannel transport handler request failed", zap.Error(err))


We actually want to log when SendSystemError has an error. That is, our client's handler returned an error, it's a system error hence we should pass it along, but sending that error had an error (it's an error-ception)! See uber/tchannel-go#716

twilly · 2018-09-12T23:23:25Z

transport/tchannel/handler.go

 		_ = call.Response().SendSystemError(getSystemError(err))
+		h.logger.Error("tchannel responseWriter failed to close", zap.Error(err))


Same thing here. :-)

r-hang · 2018-09-13T23:24:39Z

@twilly I made all of the round 2 changes you requested! I made some other cleanup such as clarifying the names of certain variables in the test cases and cleaning up the error messages, but i felt those were too minor to include in a commit message.

twilly · 2018-09-14T16:40:35Z

transport/tchannel/handler.go

-		_ = call.Response().SendSystemError(getSystemError(err))
+	if err != nil && !responseWriter.IsApplicationError() {
+		sendErr := call.Response().SendSystemError(getSystemError(err))
+		if sendErr != nil {


Typical go uses this pattern for errors:

if err := foo(); err != nil { do(err) }

You also don't need to call it sendErr since the scope would be limited to inside the if block.

twilly · 2018-09-14T16:41:49Z

transport/tchannel/handler.go

-		// TODO: log error
-		_ = call.Response().SendSystemError(getSystemError(err))
+		sendErr := call.Response().SendSystemError(getSystemError(err))
+		if sendErr != nil {


Ditto, of course.

twilly · 2018-09-14T16:50:21Z

transport/tchannel/handler.go

+	Buffer           *bufferpool.Buffer
+	Response         inboundCallResponse
+	ApplicationError bool
+	HeaderCase       headerCase


I don't think we need to modify this since the use is in faultyHandlerWriter and its methods don't need access to these fields.

https://play.golang.org/p/qLbk_7Kwi1_U

twilly · 2018-09-14T17:47:57Z

transport/tchannel/handler.go

+		if sendErr != nil {
+			h.logger.Error("SendSystemError failed", zap.Error(sendErr))
+		}
+		h.logger.Error("tchannel transport handler request failed", zap.Error(err))


nit: A slightly cleaner message would be "SendSystemError failed" and "handler failed" respectfully. Zap's loggers can add a name to its logger. Doing so would give the user more context without repeating yourself.

twilly

This looks pretty good. I left a few comments that you can address, but otherwise +1 here. Let's get a second review.

twilly · 2018-09-17T18:55:00Z

transport/tchannel/channel_transport.go

@@ -139,7 +139,7 @@ func (t *ChannelTransport) start() error {
 		for s := range services {
 			sc := t.ch.GetSubChannel(s)
 			existing := sc.GetHandlers()
-			sc.SetHandler(handler{existing: existing, router: t.router, tracer: t.tracer})
+			sc.SetHandler(handler{existing: existing, router: t.router, tracer: t.tracer, logger: t.logger, newResponseWriter: newHandlerWriter})


nit: maybe it's better to have the response writer factory in ChannelTransport to copy into handlers, like with the router, tracer, and logger.

Moved in the second commit in this PR

twilly · 2018-09-17T20:42:08Z

transport/tchannel/handler_test.go

+		format         tchannel.Format
+		headers        []byte
+		wantHeaders    map[string]string
+		faultyFuncs    map[string]bool


Instead of using a map of flags, just set your writer and recorder fields with the ones you want. Then your test instantiation is simpler and looks like this:

newResponseWriter = tt.newResponseWriter respRecorder = tt.newResponseRecorder()

👍 will change

twilly · 2018-09-17T20:58:36Z

transport/tchannel/handler_test.go

+		headers        []byte
+		wantHeaders    map[string]string
+		faultyFuncs    map[string]bool
+		wantNumLogs    int


This is also more complicated than necessary. You leverage zero values to know if you should validate log messages. Then you don't need to count messages.

@twilly I also use this field to count that the expected number of logs are written into the logger for the variety of error cases. Were you suggesting that I should turn this field into a boolean?

Not quite. For example, wantLogMessage is a string. If it's not initialized, then it'll be the zero value, which the empty string. You can leverage this behavior by testing like this:

if tt.wantLogMessage != "" { assert.Equals(t, logs.FilterMessage(tt.wantLogMessage).Len(), 1) }

If you have multiple messages, you can check with a slice of message strings or LoggedEntry, followed by iterating and comparing.

There's a little test philosophy going on here too. When testing log emissions I generally don't like being too strict, like expecting an exact amount and order of messages. Such a test makes mutating code difficult when we're usually expecting an approximate and fuzzy existence. Usually this is because logs are a loose and flexible interface rather than a rigidly necessary one for a functioning system.

kriskowal

It’s not strictly necessary to add any features the the deprecated ChannelTransport; we’re content to let it die on the vine. The Transport is the authoritative version, and the only version that can be constructed from configuration.

👍 to @twilly’s nits.

codecov · 2018-09-18T21:02:23Z

Codecov Report

Merging #1561 into dev will increase coverage by 0.01%.
The diff coverage is 89.28%.

@@            Coverage Diff             @@
##              dev    #1561      +/-   ##
==========================================
+ Coverage   90.24%   90.25%   +0.01%     
==========================================
  Files         223      223              
  Lines       11160    11169       +9     
==========================================
+ Hits        10071    10081      +10     
  Misses        753      753              
+ Partials      336      335       -1

Impacted Files	Coverage Δ
transport/tchannel/channel_transport.go	`85.93% <100%> (+0.22%)`	⬆️
transport/tchannel/transport.go	`85.58% <100%> (+0.26%)`	⬆️
transport/tchannel/handler.go	`84.67% <85.71%> (+0.7%)`	⬆️
transport/tchannel/peer.go	`93.84% <0%> (+1.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8184eb8...2f57705. Read the comment docs.

codecov · 2018-09-18T21:06:47Z

Codecov Report

Merging #1561 into dev will increase coverage by 0.01%.
The diff coverage is 89.28%.

@@            Coverage Diff             @@
##              dev    #1561      +/-   ##
==========================================
+ Coverage   90.24%   90.25%   +0.01%     
==========================================
  Files         223      223              
  Lines       11160    11169       +9     
==========================================
+ Hits        10071    10081      +10     
  Misses        753      753              
+ Partials      336      335       -1

Impacted Files	Coverage Δ
transport/tchannel/channel_transport.go	`85.93% <100%> (+0.22%)`	⬆️
transport/tchannel/transport.go	`85.58% <100%> (+0.26%)`	⬆️
transport/tchannel/handler.go	`84.67% <85.71%> (+0.7%)`	⬆️
transport/grpc/peer.go	`91.26% <0%> (-1.95%)`	⬇️
transport/tchannel/peer.go	`93.84% <0%> (+1.53%)`	⬆️
dispatcher_startup.go	`92.85% <0%> (+1.78%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8184eb8...2f57705. Read the comment docs.

codecov · 2018-09-18T21:06:51Z

Codecov Report

Merging #1561 into dev will increase coverage by 0.01%.
The diff coverage is 89.28%.

@@            Coverage Diff             @@
##              dev    #1561      +/-   ##
==========================================
+ Coverage   90.24%   90.25%   +0.01%     
==========================================
  Files         223      223              
  Lines       11160    11169       +9     
==========================================
+ Hits        10071    10081      +10     
  Misses        753      753              
+ Partials      336      335       -1

Impacted Files	Coverage Δ
transport/tchannel/channel_transport.go	`85.93% <100%> (+0.22%)`	⬆️
transport/tchannel/transport.go	`85.58% <100%> (+0.26%)`	⬆️
transport/tchannel/handler.go	`84.67% <85.71%> (+0.7%)`	⬆️
transport/grpc/peer.go	`91.26% <0%> (-1.95%)`	⬇️
transport/tchannel/peer.go	`93.84% <0%> (+1.53%)`	⬆️
dispatcher_startup.go	`92.85% <0%> (+1.78%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8184eb8...2f57705. Read the comment docs.

codecov · 2018-09-18T21:09:19Z

Codecov Report

Merging #1561 into dev will increase coverage by 0.01%.
The diff coverage is 89.28%.

@@            Coverage Diff             @@
##              dev    #1561      +/-   ##
==========================================
+ Coverage   90.24%   90.25%   +0.01%     
==========================================
  Files         223      223              
  Lines       11160    11169       +9     
==========================================
+ Hits        10071    10081      +10     
  Misses        753      753              
+ Partials      336      335       -1

Impacted Files	Coverage Δ
transport/tchannel/channel_transport.go	`85.93% <100%> (+0.22%)`	⬆️
transport/tchannel/transport.go	`85.58% <100%> (+0.26%)`	⬆️
transport/tchannel/handler.go	`84.67% <85.71%> (+0.7%)`	⬆️
transport/tchannel/peer.go	`93.84% <0%> (+1.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3961924...afefba8. Read the comment docs.

Zap logging was added to understand when handler calls failed, when SendSystemError failed, and when responseWriter failed to close. To test these changes, I added a responseWriter interface factory to the struct of the handler so I could control reponseWriter during testing. A recorder interface was also added to test failure cases in responseRecorder.

NewResponseWriter was moved to ChannelTransport and Transport to keep tchannel handler's constructor parameters consisent with the pattern in which these parameters come from the fields of ChannelTransport and ChannelTransport.

This drops the "handler failed" log in the TChannel transport. This log was unnecessarily added when we were increasing observability around TChannel internal errors in #1561. The context error override in #1930 ensures that makes this log redundant as richer information exists in observability logs, including latency and request attributes. Furthermore, we've had issues with this log since the latency is included in the message and makes aggregation extremely difficult. Ref T5802517

This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814

…ut (#1933) This fixes a minor logging annoyance. Some users have reported large amounts of `SendSystemError failed` and `responseWriter failed to close` logs. This was introduced in #1561 to help improve observability of TChannel system failures. However, if a client times out, we'll still log the failures; these logs are are uninformative/unactionable to users as they are expected. With this diff, we only log TChannel system failures when the client is still waiting for a response, ie has yet to time out. Ref T5305225#107015299, supersedes #1814

r-hang requested a review from twilly September 11, 2018 19:33

twilly suggested changes Sep 12, 2018

View reviewed changes

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch 4 times, most recently from 2f69a2b to dfaee48 Compare September 12, 2018 22:27

twilly reviewed Sep 12, 2018

View reviewed changes

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch 5 times, most recently from 27a3b42 to 6c17f84 Compare September 13, 2018 23:21

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from 6c17f84 to 54d7ed8 Compare September 13, 2018 23:33

twilly reviewed Sep 14, 2018

View reviewed changes

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from 54d7ed8 to 3f1e836 Compare September 17, 2018 06:46

twilly approved these changes Sep 17, 2018

View reviewed changes

twilly requested review from peats-bond and zmt September 17, 2018 21:01

kriskowal approved these changes Sep 17, 2018

View reviewed changes

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch 4 times, most recently from 6f63e0d to cdf9cf7 Compare September 18, 2018 18:28

abhinav added the CI label Sep 18, 2018

UberOpenSourceBot removed the CI label Sep 18, 2018

abhinav added the CI label Sep 18, 2018

UberOpenSourceBot removed the CI label Sep 18, 2018

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from cdf9cf7 to d70430a Compare September 18, 2018 18:44

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from d70430a to 74447e9 Compare September 18, 2018 18:57

r-hang added 2 commits September 19, 2018 09:55

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from 2f57705 to d58d50b Compare September 19, 2018 16:56

twilly changed the title ~~rhang/adding-logging-to-tchannel-handler~~ Adding logging to tchannel handler error cases Sep 19, 2018

r-hang force-pushed the rhang/adding-logging-to-tchannel-handler branch from d58d50b to afefba8 Compare September 19, 2018 17:37

r-hang merged commit 360ecaa into yarpc:dev Sep 19, 2018

r-hang deleted the rhang/adding-logging-to-tchannel-handler branch September 19, 2018 17:48

peats-bond mentioned this pull request May 15, 2020

tchannel: Drop redundant "handler failed" log #1932

Merged

2 tasks

peats-bond mentioned this pull request May 18, 2020

tchannel: Only log system error failures when clients haven't timed out #1933

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding logging to tchannel handler error cases #1561

Adding logging to tchannel handler error cases #1561

r-hang commented Sep 11, 2018 •

edited by twilly

Loading

CLAassistant commented Sep 11, 2018 •

edited

Loading

twilly Sep 11, 2018

twilly Sep 11, 2018

twilly Sep 12, 2018

twilly Sep 12, 2018

r-hang Sep 12, 2018

twilly Sep 12, 2018

r-hang Sep 12, 2018 •

edited

Loading

twilly Sep 14, 2018

twilly Sep 12, 2018

twilly Sep 12, 2018

twilly Sep 12, 2018

r-hang Sep 18, 2018

twilly Sep 12, 2018

twilly Sep 12, 2018

r-hang commented Sep 13, 2018

twilly Sep 14, 2018

twilly Sep 14, 2018

twilly Sep 14, 2018

twilly Sep 14, 2018

twilly left a comment

twilly Sep 17, 2018

r-hang Sep 18, 2018

twilly Sep 17, 2018

r-hang Sep 18, 2018

twilly Sep 17, 2018

r-hang Sep 18, 2018 •

edited

Loading

twilly Sep 18, 2018

kriskowal left a comment

codecov bot commented Sep 18, 2018

codecov bot commented Sep 18, 2018

codecov bot commented Sep 18, 2018

codecov bot commented Sep 18, 2018 •

edited

Loading

		_ = call.Response().SendSystemError(getSystemError(err))
		h.logger.Error("tchannel callHandler error", zap.Error(err))

		_ = call.Response().SendSystemError(getSystemError(err))
		h.logger.Error("tchannel transport handler request failed", zap.Error(err))

		_ = call.Response().SendSystemError(getSystemError(err))
		h.logger.Error("tchannel responseWriter failed to close", zap.Error(err))

Adding logging to tchannel handler error cases #1561

Adding logging to tchannel handler error cases #1561

Conversation

r-hang commented Sep 11, 2018 • edited by twilly Loading

CLAassistant commented Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-hang Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-hang commented Sep 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twilly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-hang Sep 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kriskowal left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 18, 2018

Codecov Report

codecov bot commented Sep 18, 2018

Codecov Report

codecov bot commented Sep 18, 2018

Codecov Report

codecov bot commented Sep 18, 2018 • edited Loading

Codecov Report

r-hang commented Sep 11, 2018 •

edited by twilly

Loading

CLAassistant commented Sep 11, 2018 •

edited

Loading

r-hang Sep 12, 2018 •

edited

Loading

r-hang Sep 18, 2018 •

edited

Loading

codecov bot commented Sep 18, 2018 •

edited

Loading