Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

Closed
trask opened this issue May 26, 2020 · 18 comments

Comments

@trask
Copy link
Member

trask commented May 26, 2020

This seems reasonable to me, wondering what others think?

See comment by @anuraaga at open-telemetry/opentelemetry-specification#530 (comment).

@anuraaga
Copy link
Contributor

For a bit more context, CLIENT / SERVER spans are important for modeling service dependencies, e.g. in a dependency graph like in Zipkin. By modeling HTTP as the CLIENT span, something like HTTP POST becomes the service that shows up in everyone's service graph whereas we actually want e.g., DynamoDb there.

@iNikem
Copy link
Contributor

iNikem commented May 26, 2020

@trask do you mean about a solution specific to aws-sdk? Or a generic one as proposed in open-telemetry/opentelemetry-specification#530?

@trask
Copy link
Member Author

trask commented May 26, 2020

First, I'm just wondering if we want to do this:

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation?

(I'm leaning towards yes, but would like others' opinions)

I haven't really thought about how we would implement this yet (a generic mechanism supported by the OpenTelemetry API seems ideal).

@iNikem
Copy link
Contributor

iNikem commented May 27, 2020

In some sense there is a general problem here both for client and server spans. Both for monitoring incoming and outgoing requests you usually have some higher level library inside and low-level middleware outside. Like this:

Servlet Filter -> Spring MVC -> AWS SDK -> Netty

We certainly need to have generic instrumentation for lower level components in Otel, like Servlets and Netty. But in the particular case above users probably care only about Spring and AWS SDK. So if consider "suppressing" Netty in this case, we probably should consider "suppressing" Servlets as well.

On the other hand, there is an opinion that Otel auto-instrumentation should just report what has happened. Without distorting any truth or doing any magic. Objective observer, not smart interpreter. In this case one can argue that such filtering/suppressing should happen in observability backend or in Collector. But not earlier.

@trask
Copy link
Member Author

trask commented May 27, 2020

If people agree that the Netty spans underneath the AWS SDK spans don't provide much value, then I think it makes sense to embed those "smarts" into the auto-instrumentation and suppress them right there (so every backend doesn't have to have this special logic for combination of AWS SDK + Netty, and also to save on cost of capturing/transmitting/ingesting the Netty span data).

[btw, in the Servlet Filter / Spring MVC case, if anything, I would want an option to suppress the Spring MVC span, and keep the overarching Servlet span]

@anuraaga
Copy link
Contributor

anuraaga commented May 28, 2020

Strong +1 to suppressing the Netty spans, but open to it being configurable if there seem to be users that wouldn't need that.

We could take a step back and consider this for any RPC not just AWS SDK - the service information is generally going to be in the RPC instrumentation, not its HTTP instrumentation. In every trace system I've looked at, marking the HTTP spans as client/server and the RPC spans as internal breaks service / dependency graphs because they rely on the leaf / root between services to be the RPC.

Instead of

Client -- DynamoDb
       \
        S3

it's

Client -- HTTP POST

because HTTP POST is the only client span in the trace. Trying to normalize this in collectors might be possible, but at the least it'd require some complicated buffering.

What I would love is if HTTP instrumentation could augment a parent RPC span instead of replace it. The ideal client span for me would have

  • Kind = CLIENT (similar span will have kind = SERVER in the backend side too)
  • Name of RPC service / method (e.g. DynamoDb.PutItem)
  • RPC status code
  • HTTP content length
  • Timed events for things like DNS resolution, payload send, payload receive

Hope this context helps, maybe the concept of multiple instrumentation working on the same span can be an interesting one to explore.

@iNikem
Copy link
Contributor

iNikem commented May 28, 2020

[btw, in the Servlet Filter / Spring MVC case, if anything, I would want an option to suppress the Spring MVC span, and keep the overarching Servlet span]

Can you explain your preference here? Am I correct that you want to suppress Spring span on the assumption that http.route is propagated from it to the parent Servlet span?

@iNikem
Copy link
Contributor

iNikem commented May 28, 2020

Then again, suppressing HTTP spans will probably require transferring their attributes, like http.host and net.peer to the RPC span. Don't we try to push too much magic into auto-instrumentation?

@anuraaga
Copy link
Contributor

I think a couple of advantages instrumentation has over collector in this case are

  • Instrumentation is stateful and has the hierarchy of spans. Doing processing that is based on the hierarchy (in this case folding HTTP spans into RPC spans) is trivial. But collectors process spans individually so pushing hierarchies processing is possible with buffering but doesn't seem natural

  • Instrumentation knows what's RPC what's HTTP better than the collector. We do have some semantic conventions but it seems fragile to rely on them. If we really wanted to push this to collectors I think we'd want a formal API. I guess something like a span kind TRANSPORT. Currently, only having one server and leaf client span with everything else internal is not expressive enough when including these transport layer spans.

Anything can be solved with some code :) But it feels to me like it is natural when handling these hierarchies in the instrumentation and becomes heuristics (magic?) when leaving the app.

@iNikem
Copy link
Contributor

iNikem commented May 28, 2020

Why do you think that instrumentation has the hierarchy of spans? That may be correct in synchronous case, but things become more complicated in async case. If this assumption of yours is correct, then things become clearer indeed.

@anuraaga
Copy link
Contributor

Hierarchy may have been unclear since it's not necessarily from the root but a sort of local hierarchy. Even in asynchronous, you must have access to at least your parent or you will not be able to set the parent ID.

That being said it does bring to mind a contrived example.

RPC span starts, passes job to another thread and ends. The other thread then runs the HTTP connection which gets the client span. It seems wrong at least from a UX perspective if the RPC span ends in microseconds and the transport ends in milliseconds, users want to know how long the RPC took - transport is just an implementation detail.

Does this example also help illustrate how instrumentation needs to be aware of this tight coupling between RPC and transport?

@iNikem
Copy link
Contributor

iNikem commented May 28, 2020

To be precise, you must have access to your parent id, not the whole parent span :) But in practise I think we always pass the whole span around, so fair point.

Very good example, it really brings a lot of thoughts... Maybe it should be just 1 span? Which starts when RPC starts and ends when HTTP transport finishes its job. With attributes "merged" from both sources.

It seems that we are at the very important crossroad: should instrumentations be independent or should they tightly and explicitly collaborate?

@trask
Copy link
Member Author

trask commented May 29, 2020

Which starts when RPC starts and ends when HTTP transport finishes its job.

Async RPC clients typically have some kind of callback that we use to know when the RPC request is fully complete (i.e. when the response has been fully received).

So I believe async RPC spans are already correct from a timing perspective (no need to end the span in the underlying transport).

It seems that we are at the very important crossroad: should instrumentations be independent or should they tightly and explicitly collaborate?

For the specific proposal in the title, I don't think this requires any tight or explicit collaboration.

It just requires a mechanism to suppress downstream spans (e.g. open-telemetry/opentelemetry-specification#530).

@anuraaga
Copy link
Contributor

@trask Yeah my example was contrived as what "losing your parent" might look like, but currently the instrumentation is fine.

Supressing transport spans vs including them is still better for modeling for me, but treating "transport instrumentation" specially, so they can contribute to an RPC span by adding transport attributes, does seem even nicer. I can't think of any other scenarios than transport, so having a special category of instrumentation for them could make sense.

@trask
Copy link
Member Author

trask commented May 29, 2020

This makes sense to me. And I think the modeling issue you raise is important, where you want the CLIENT span to be "DynamoDB" instead of "HTTP GET".

I could see the Netty instrumentation supporting a "transport" mode, where (based on some Context flag) it sets attributes on the parent span (and maybe add events to the parent span) as opposed to creating it's own span.

@iNikem
Copy link
Contributor

iNikem commented May 29, 2020

For the specific proposal in the title, I don't think this requires any tight or explicit collaboration.

I don't like the idea of bringing in one more special case of inter-instrumentation communication without taking into account a bigger picture. We will end up with one specific case of Spring-Servlet communication, one specific case for RPC-Netty communication, etc etc. I would very much prefer a limited set of common mechanism how such communication could happen and be documented and obvious from reading the code.

@iNikem
Copy link
Contributor

iNikem commented Jul 10, 2020

@trask @anuraaga I think we have agreed on this, that this should indeed be done. So I propose to close this issue by answering "yes" and following the actual implementation in #465

@anuraaga
Copy link
Contributor

Yup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants