Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

trask · 2020-05-26T02:21:59Z

This seems reasonable to me, wondering what others think?

See comment by @anuraaga at open-telemetry/opentelemetry-specification#530 (comment).

anuraaga · 2020-05-26T03:22:08Z

For a bit more context, CLIENT / SERVER spans are important for modeling service dependencies, e.g. in a dependency graph like in Zipkin. By modeling HTTP as the CLIENT span, something like HTTP POST becomes the service that shows up in everyone's service graph whereas we actually want e.g., DynamoDb there.

iNikem · 2020-05-26T05:41:13Z

@trask do you mean about a solution specific to aws-sdk? Or a generic one as proposed in open-telemetry/opentelemetry-specification#530?

trask · 2020-05-26T21:42:54Z

First, I'm just wondering if we want to do this:

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation?

(I'm leaning towards yes, but would like others' opinions)

I haven't really thought about how we would implement this yet (a generic mechanism supported by the OpenTelemetry API seems ideal).

iNikem · 2020-05-27T07:41:26Z

In some sense there is a general problem here both for client and server spans. Both for monitoring incoming and outgoing requests you usually have some higher level library inside and low-level middleware outside. Like this:

Servlet Filter -> Spring MVC -> AWS SDK -> Netty

We certainly need to have generic instrumentation for lower level components in Otel, like Servlets and Netty. But in the particular case above users probably care only about Spring and AWS SDK. So if consider "suppressing" Netty in this case, we probably should consider "suppressing" Servlets as well.

On the other hand, there is an opinion that Otel auto-instrumentation should just report what has happened. Without distorting any truth or doing any magic. Objective observer, not smart interpreter. In this case one can argue that such filtering/suppressing should happen in observability backend or in Collector. But not earlier.

trask · 2020-05-27T22:24:32Z

If people agree that the Netty spans underneath the AWS SDK spans don't provide much value, then I think it makes sense to embed those "smarts" into the auto-instrumentation and suppress them right there (so every backend doesn't have to have this special logic for combination of AWS SDK + Netty, and also to save on cost of capturing/transmitting/ingesting the Netty span data).

[btw, in the Servlet Filter / Spring MVC case, if anything, I would want an option to suppress the Spring MVC span, and keep the overarching Servlet span]

anuraaga · 2020-05-28T01:18:34Z

Strong +1 to suppressing the Netty spans, but open to it being configurable if there seem to be users that wouldn't need that.

We could take a step back and consider this for any RPC not just AWS SDK - the service information is generally going to be in the RPC instrumentation, not its HTTP instrumentation. In every trace system I've looked at, marking the HTTP spans as client/server and the RPC spans as internal breaks service / dependency graphs because they rely on the leaf / root between services to be the RPC.

Instead of

Client -- DynamoDb
       \
        S3

it's

Client -- HTTP POST

because HTTP POST is the only client span in the trace. Trying to normalize this in collectors might be possible, but at the least it'd require some complicated buffering.

What I would love is if HTTP instrumentation could augment a parent RPC span instead of replace it. The ideal client span for me would have

Kind = CLIENT (similar span will have kind = SERVER in the backend side too)
Name of RPC service / method (e.g. DynamoDb.PutItem)
RPC status code
HTTP content length
Timed events for things like DNS resolution, payload send, payload receive

Hope this context helps, maybe the concept of multiple instrumentation working on the same span can be an interesting one to explore.

iNikem · 2020-05-28T07:40:52Z

[btw, in the Servlet Filter / Spring MVC case, if anything, I would want an option to suppress the Spring MVC span, and keep the overarching Servlet span]

Can you explain your preference here? Am I correct that you want to suppress Spring span on the assumption that http.route is propagated from it to the parent Servlet span?

iNikem · 2020-05-28T07:54:21Z

Then again, suppressing HTTP spans will probably require transferring their attributes, like http.host and net.peer to the RPC span. Don't we try to push too much magic into auto-instrumentation?

anuraaga · 2020-05-28T10:16:06Z

I think a couple of advantages instrumentation has over collector in this case are

Instrumentation is stateful and has the hierarchy of spans. Doing processing that is based on the hierarchy (in this case folding HTTP spans into RPC spans) is trivial. But collectors process spans individually so pushing hierarchies processing is possible with buffering but doesn't seem natural
Instrumentation knows what's RPC what's HTTP better than the collector. We do have some semantic conventions but it seems fragile to rely on them. If we really wanted to push this to collectors I think we'd want a formal API. I guess something like a span kind TRANSPORT. Currently, only having one server and leaf client span with everything else internal is not expressive enough when including these transport layer spans.

Anything can be solved with some code :) But it feels to me like it is natural when handling these hierarchies in the instrumentation and becomes heuristics (magic?) when leaving the app.

iNikem · 2020-05-28T10:21:04Z

Why do you think that instrumentation has the hierarchy of spans? That may be correct in synchronous case, but things become more complicated in async case. If this assumption of yours is correct, then things become clearer indeed.

anuraaga · 2020-05-28T10:42:49Z

Hierarchy may have been unclear since it's not necessarily from the root but a sort of local hierarchy. Even in asynchronous, you must have access to at least your parent or you will not be able to set the parent ID.

That being said it does bring to mind a contrived example.

RPC span starts, passes job to another thread and ends. The other thread then runs the HTTP connection which gets the client span. It seems wrong at least from a UX perspective if the RPC span ends in microseconds and the transport ends in milliseconds, users want to know how long the RPC took - transport is just an implementation detail.

Does this example also help illustrate how instrumentation needs to be aware of this tight coupling between RPC and transport?

iNikem · 2020-05-28T10:53:53Z

To be precise, you must have access to your parent id, not the whole parent span :) But in practise I think we always pass the whole span around, so fair point.

Very good example, it really brings a lot of thoughts... Maybe it should be just 1 span? Which starts when RPC starts and ends when HTTP transport finishes its job. With attributes "merged" from both sources.

It seems that we are at the very important crossroad: should instrumentations be independent or should they tightly and explicitly collaborate?

trask · 2020-05-29T00:26:31Z

Which starts when RPC starts and ends when HTTP transport finishes its job.

Async RPC clients typically have some kind of callback that we use to know when the RPC request is fully complete (i.e. when the response has been fully received).

So I believe async RPC spans are already correct from a timing perspective (no need to end the span in the underlying transport).

It seems that we are at the very important crossroad: should instrumentations be independent or should they tightly and explicitly collaborate?

For the specific proposal in the title, I don't think this requires any tight or explicit collaboration.

It just requires a mechanism to suppress downstream spans (e.g. open-telemetry/opentelemetry-specification#530).

anuraaga · 2020-05-29T00:29:55Z

@trask Yeah my example was contrived as what "losing your parent" might look like, but currently the instrumentation is fine.

Supressing transport spans vs including them is still better for modeling for me, but treating "transport instrumentation" specially, so they can contribute to an RPC span by adding transport attributes, does seem even nicer. I can't think of any other scenarios than transport, so having a special category of instrumentation for them could make sense.

trask · 2020-05-29T01:15:33Z

This makes sense to me. And I think the modeling issue you raise is important, where you want the CLIENT span to be "DynamoDB" instead of "HTTP GET".

I could see the Netty instrumentation supporting a "transport" mode, where (based on some Context flag) it sets attributes on the parent span (and maybe add events to the parent span) as opposed to creating it's own span.

iNikem · 2020-05-29T07:30:44Z

For the specific proposal in the title, I don't think this requires any tight or explicit collaboration.

I don't like the idea of bringing in one more special case of inter-instrumentation communication without taking into account a bigger picture. We will end up with one specific case of Spring-Servlet communication, one specific case for RPC-Netty communication, etc etc. I would very much prefer a limited set of common mechanism how such communication could happen and be documented and obvious from reading the code.

iNikem · 2020-07-10T19:25:07Z

@trask @anuraaga I think we have agreed on this, that this should indeed be done. So I propose to close this issue by answering "yes" and following the actual implementation in #465

anuraaga · 2020-07-11T06:39:11Z

Yup

This was referenced May 31, 2020

Move client span creation to decorator and automatically suppress creation of neste… #460

Merged

Allow overriding Span kind for AWS SDK v2. #464

Closed

iNikem mentioned this issue Jun 1, 2020

Generic mechanism for preventing multiple Server/Client spans #465

Closed

trask added the specification label Jun 2, 2020

anuraaga closed this as completed Jul 11, 2020

trask mentioned this issue Jun 3, 2021

Clarify some details about span kind and the meanings of the values. open-telemetry/opentelemetry-specification#1738

Merged

lmolkova mentioned this issue Jul 9, 2021

Cleaning up context on reusable carriers before or after injection open-telemetry/opentelemetry-specification#1811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

trask commented May 26, 2020

anuraaga commented May 26, 2020

iNikem commented May 26, 2020

trask commented May 26, 2020

iNikem commented May 27, 2020

trask commented May 27, 2020

anuraaga commented May 28, 2020 •

edited by trask

Loading

iNikem commented May 28, 2020

iNikem commented May 28, 2020

anuraaga commented May 28, 2020

iNikem commented May 28, 2020

anuraaga commented May 28, 2020

iNikem commented May 28, 2020

trask commented May 29, 2020

anuraaga commented May 29, 2020

trask commented May 29, 2020

iNikem commented May 29, 2020

iNikem commented Jul 10, 2020

anuraaga commented Jul 11, 2020

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

Should aws-sdk instrumentation emit CLIENT spans and suppress downstream http client library instrumentation? #440

Comments

trask commented May 26, 2020

anuraaga commented May 26, 2020

iNikem commented May 26, 2020

trask commented May 26, 2020

iNikem commented May 27, 2020

trask commented May 27, 2020

anuraaga commented May 28, 2020 • edited by trask Loading

iNikem commented May 28, 2020

iNikem commented May 28, 2020

anuraaga commented May 28, 2020

iNikem commented May 28, 2020

anuraaga commented May 28, 2020

iNikem commented May 28, 2020

trask commented May 29, 2020

anuraaga commented May 29, 2020

trask commented May 29, 2020

iNikem commented May 29, 2020

iNikem commented Jul 10, 2020

anuraaga commented Jul 11, 2020

anuraaga commented May 28, 2020 •

edited by trask

Loading