-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite gRPC server interceptor #1171
Conversation
Hello! I have a PR to move some files you have in this PR to the Contrib repo, please let me know if this gets merged before the PR in the Contrib repo. Please see https://github.com/open-telemetry/opentelemetry-python-contrib/pulls/59 |
I'd be happy to make this PR in the other repo if you all prefer... |
|
||
# Split up the peer to keep with how other telemetry sources | ||
# do it. This looks like ipv6:[::1]:57284 or ipv4:127.0.0.1:57284. | ||
host, port = context.peer().split(":", 1)[1].rsplit(":", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to b the same format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, but I could put a check in there to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some references to expected formats in the C library, so it looks like yes, this is the format it will be.... except for that one which is comma-separated, so I definitely need to update this. Though I'm not sure I understand why there are two addresses there.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'd suggest to be defensive around code like this and perhaps wrap it in a try .. except block with an error log. We don't want instrumentation to break apps :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it and made the log a warning rather than error, since it's not really an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't read it. I'd say it actually is an error as an exception is raised. We are just silencing the error so it doesn't crash users' apps. IMO it should surface in the loudest possible manner without crashing apps so we get to know about it and fix the instrumentation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. This condition is not something I as the application developer would have any control over, and by making it an error rather than warning, now I have to consider suppressing messages from the tracing library in order to not have messages about errors I can't fix in my logs, which I may do things like trigger alerts with via tools such as sentry.io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks like a nice improvement. Left some comments/questions.
# we handle in our context wrapper. | ||
# Here, we're interested in uncaught exceptions. | ||
# pylint:disable=unidiomatic-typecheck | ||
if type(error) != Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerned that this might result in some non-gRPC abort exceptions being ignored. Do gRPC aborts raise Exception
instead of some specialized type like GRPCAbortException
? Is there any other way to test for this like exception message?
Also, what if we did record gRPC abort exceptions? Is that unexpected? Does gRPC use exceptions as flow control instead of error in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unfortunately once we're at this point, if the code called grpc.abort()
the exception here is a bare Exception
with no useful details at all (see here: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_server.py#L319), and there's no other place to capture it. I've tried a whole bunch of ways to get this over the past year or so, and this is the best I've come up with.
If abort()
was not called, the exception here is whatever was raised, and ultimately it'll get wrapped up in a grpc.RpcError
with status = UNKNOWN
(HTTP 500 equivalent) and the client will receive that result.
I've certainly tried to find some better way to handle this, but there's just not any other information available to the interceptor here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code you linked, it seems the function is setting always sets code
to a non-OK value indicating this is always an error condition. In that case, may be we should always record it with span.record_exception()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that works for me - I'm not as familiar with Otel itself as you guys :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually - I forgot how this actually works here. A bare exception is going to usually be a grpc abort, but not all of those are actually exceptions, there are user errors such as the equivalents of HTTP 400 series which aren't really server exceptions. Do we want to mark those as exceptions in Otel? I wouldn't think so...
I handle grpc aborts in the abort()
call in _OpenTelemetryServicerContext
(and can certainly use span.record_exception()
for the non-user stuff there), this block here was intended to catch anything that is NOT an abort, and therefore a real exception, which are unlikely to be bare Exception
(because of good practice).
There's no other way to distinguish a true bare Exception
from a grpc abort here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
404s are not treated as errors generally, no. Does this case happen in case of a 404?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - gRPC unfortunately uses exceptions as program flow, so all 400-like errors which you cause by running grpc.abort()
show up here, as bare exceptions. That's the part I was trying to work around and why I didn't treat those as exceptions.
I can add some checks for the "not user error" types of exceptions in the abort()
call in _OpenTelemetryServicerContext
, it's a small number of them that are actual exceptional conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's do it if it is easy enough and doesn't complicate the instrumentation too much. Otherwise I think it's fine the way you have it as it is not that likely that users will raise bare Exception (but still possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh this is complicated - at the point where I'd like to record this as an exception, there's not a useful exception to record, because aborts aren't actually exceptions.
Following conventions in the docs on HTTP details in traces, I'll set the Status
properly.
Edit: it seems I already do, but I'll make sure these are set properly, they aren't right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay did a bit more digging - it's been a while since I wrote this. The parts in the Context
wrapper where I set the status code line up with how Otel status codes work - they match 1:1 with the grpc.StatusCode
values and so there's no change needed here to handle these cases.
# Here, we're interested in uncaught exceptions. | ||
# pylint:disable=unidiomatic-typecheck | ||
if type(error) != Exception: | ||
span.set_attribute("error", repr(error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions should be recorded as events in Otel. I think we should use span.record_exception(error)
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I didn't know that was a thing - will do.
The last PR for the move to Contrib just needs this change to go in so that we can take the changes and be fully moved over! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you for the PR! LGTM just a few comments on the attributes
...tation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/__init__.py
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...ntation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/_server.py
Outdated
Show resolved
Hide resolved
...tation/opentelemetry-instrumentation-grpc/src/opentelemetry/instrumentation/grpc/__init__.py
Outdated
Show resolved
Hide resolved
FYI the linter test passes locally, and the code I changed in the most recent PR doesn't change enough functionality to start failing a test - I assume something changed? Do I rebase? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I also find it weird that the lint changes failed, but I think a rebase would be the right way to go either way since it looks like f3cdfa2 changed |
I really dislike this pattern, changing other libraries, but it's the way that most of these things work, so I'll stick with it. One can still use the Interceptor directly if desired. Also reworked the wrapped server so it's no longer a wrapper, but simply adds on our interceptor. This allows adding any custom interceptors as well. A caveat here is that if you need to specifically order interceptors, you'll need to do it manually rather than using the instrumentor-enable thing.
Based on a suggestion by @owais - thanks!
Apparently the peer value can include a comma-separated address, see the tests here: https://github.com/grpc/grpc/blob/master/test/core/client_channel/resolvers/sockaddr_resolver_test.cc#L90 This updates the parsing, and adds a try/catch block with a warning log.
…emetry/instrumentation/grpc/__init__.py Co-authored-by: Aaron Abbott <aaronabbott@google.com>
Per suggestion, these attributes should be in the `rpc` namespace.
I was using Zipkin and some useful flags were missing, but when using Jaeger they're present, so I didn't actually need to set these explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and for working through all the feedback! This looks good, just one question that comes to mind, do we need to create an issue to handle errors differently now that they no longer map to grpc status codes after this change #1214, if so we can address in a separate issue/pr
Oh I didn't see that. It does seem like we should... do you have a link to the current official doc? The PR you linked and the master branch both still have |
This PR that made the change is this one https://github.com/open-telemetry/opentelemetry-python/pull/1282/files#diff-88b090cfdc1fdd9f100dfb755740b7a419035461805927dbee8197534634eec3. Here's the current state of things: https://github.com/open-telemetry/opentelemetry-python/blob/master/opentelemetry-api/src/opentelemetry/trace/status.py |
Oops, I was looking at the wrong version of this file - I see what you mean now. I'll make another PR with the fix tonight. |
…-telemetry#1171) Co-authored-by: Daniel Dyla <dyladan@users.noreply.github.com>
Description
Rework the gRPC server interceptor to follow current practices with gRPC, and to fix several problems:
The previous method would wrap the
grpc.Server()
improperly in order to add its interceptor, preventing interceptors from be added after the OpenTelemetry interceptor. I've removed the wrapper entirely, as it's not necessary, and made the Interceptor be part of the public API, to allow users to instantiate this and insert it into the interceptor chain anywhere they want. There is still theinstrument
anduninstrument
methods to basically monkeypatch the gRPC library, should you choose to use that, but these are reworked to provide the Interceptor in a more modern way.The wrapper around the gRPC
Context
was incompletely implemented, probably before new features were added.The method used to try to catch errors in RPCs was incomplete - interceptors cannot inspect the results of a call, so we now use our Context wrapper to obtain response codes and messages. We do still catch non-abort exceptions (i.e. those which are not an empty
Exception()
) in the Interceptor, and provide any message via the span.New features:
abort()
) exceptionsFixes #1112
Type of change
This change is slightly breaking, depending on how you used the library. If you follow the minimalistic example in the docs, nothing will have changed, but if you implemented like in the sample application, there will be changes required.
How Has This Been Tested?
I've used the standard tests provided, and implemented a detailed application with this library as it is now. I could probably add some tests for the new data in the spans, but I'm not very familiar with that part of OpenTelemetry.
Checklist:
I'm not sure which changelog needs to be updated, or what to put in it.