Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observability: Override responses with Context errors #1930

Merged
merged 4 commits into from
May 19, 2020
Merged

Conversation

peats-bond
Copy link
Contributor

If a caller timed out waiting for a request, it would emit a timeout. However,
the YARPC callee would continue fulfilling the request, and emit/log the success
or error, never reporting a timeout. This causes a misalignment of metrics
between YARPC metrics and caller/sidecar metrics. YARPC callees were never
returning timeout errors.

To fix this, this change overrides handler responses with the context error, if
the observability middleware sees that the context deadline expired or if there
was a context cancellation. Users responses are thrown away, but logged in
the existing YARPC logs, under a dropped field.

Commits are individually reviewable and will be rebased onto dev.

  • Description and context for reviewers: one partner, one stranger
  • Entry in CHANGELOG.md

@codecov
Copy link

codecov bot commented May 15, 2020

Codecov Report

Merging #1930 into dev will increase coverage by 2.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1930      +/-   ##
==========================================
+ Coverage   85.07%   87.07%   +2.00%     
==========================================
  Files         240      249       +9     
  Lines       12512    13133     +621     
==========================================
+ Hits        10645    11436     +791     
+ Misses       1470     1320     -150     
+ Partials      397      377      -20     
Impacted Files Coverage Δ
...rnal/crossdock/client/errorshttpclient/behavior.go 97.90% <100.00%> (-0.01%) ⬇️
internal/observability/call.go 90.34% <100.00%> (+1.11%) ⬆️
internal/observability/middleware.go 96.55% <100.00%> (+0.60%) ⬆️
peer/hashring32/list.go 69.09% <0.00%> (ø)
peer/hashring32/ring.go 73.43% <0.00%> (ø)
compressor/snappy/snappy.go 100.00% <0.00%> (ø)
peer/hashring32/config.go 87.50% <0.00%> (ø)
peer/peerlist/v2/peer.go 82.60% <0.00%> (ø)
internal/clock/fake.go 94.31% <0.00%> (ø)
peer/peerlist/v2/list.go 71.32% <0.00%> (ø)
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4632f2b...c615aeb. Read the comment docs.

peats-bond pushed a commit that referenced this pull request May 15, 2020
This drops the "handler failed" log in the TChannel transport. This
log was unnecessarily added when we were increasing observability
around TChannel internal errors in #1561.

The context error override in #1930 ensures that makes this log
redundant as richer information exists in observability logs,
including latency and request attributes.

Furthermore, we've had issues with this log since the latency is
included in the message and makes aggregation extremely difficult.

Ref T5802517
peats-bond pushed a commit that referenced this pull request May 15, 2020
This drops the "handler failed" log in the TChannel transport. This
log was unnecessarily added when we were increasing observability
around TChannel internal errors in #1561.

The context error override in #1930 ensures that makes this log
redundant as richer information exists in observability logs,
including latency and request attributes.

Furthermore, we've had issues with this log since the latency is
included in the message and makes aggregation extremely difficult.

Ref T5802517
peats-bond pushed a commit that referenced this pull request May 18, 2020
This drops the "handler failed" log in the TChannel transport. This
log was unnecessarily added when we were increasing observability
around TChannel internal errors in #1561.

The context error override in #1930 ensures that makes this log
redundant as richer information exists in observability logs,
including latency and request attributes.

Furthermore, we've had issues with this log since the latency is
included in the message and makes aggregation extremely difficult.

Ref T5802517
Alex Peats-Bond added 4 commits May 19, 2020 16:45
This duplicates the `EndWithAppError` functions so that `Call` and
`Handle` have different call paths, `EndCallWithAppError` and
`EndHandleWithAppError` respectively. The handle signature will change
in the next commit.

No functionality has changed.
THe majority of code here is identical to that introduced and
validated in D4334363.

Here, we add a context override function that determines whether or
not to throw away the user's response, in favour of the context
error. This ensures that YARPC metrics better reflect timeouts,
including higher correlation with sidecar/proxy metrics.
This enables the `End{Call,Handle}WithAppError` methods to set
additional log fields. This will be used in the following commit to
add observability into dropped responses.

No functionality has changed
Now that we are potentially overriding handler responses with context
errors, we need to let users know somehow. This change adds a
`dropped` log field into the existing metrics, showing the underlying
success, error or application error that was dropped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants