Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Tracer propagating SpanContext by default or not #208

Closed
carlosalberto opened this issue Aug 9, 2019 · 15 comments
Closed

Default Tracer propagating SpanContext by default or not #208

carlosalberto opened this issue Aug 9, 2019 · 15 comments
Assignees
Milestone

Comments

@carlosalberto
Copy link
Contributor

Besides the base Tracer interface, the Java implementation offers a default, no-op DefaultTracer (https://github.com/open-telemetry/opentelemetry-java/blob/master/api/src/main/java/io/opentelemetry/trace/DefaultTracer.java) that propagates SpanContext.

This has the advantage of providing context propagation by default, but adds a little bit of implementation-specific part on the API side (as actual in-process propagation needs to be done, which in turn requires additional dependencies, etc).

Consider having this no-op Tracer as a pure no-op instance, without any context propagation at all.

(A possible alternative would be offering the no-op + context propagation option in the SDK, where actual implementation code can take place)

cc @bogdandrutu @iredelmeier

@c24t
Copy link
Member

c24t commented Aug 9, 2019

Suppose an application has a dependency on an OT web framework or RPC integration, but doesn't load the SDK. Don't we want to support context propagation by default in that case?

On the other hand, if there's no mechanism for including the span context in outgoing calls then we shouldn't propagate the context in-process.

It might be helpful to frame the problem this way: libraries depend on OT either to:

  1. generate spans for internal operations (e.g. a ORM integration)
  2. de/serialize span data for incoming/outgoing requests (e.g. a RPC integration)

If an application depends on any library like (2) -- or has an explicit dependency on the API -- it and all other libraries it depends on should get a DefaultTracer by default. If all of the application's dependencies are like (1) then those dependencies should get a true NoopTracer by default.

Does this sound right to you? Zero-config inter-process context prop seems like an important use case to me, but if it turns out that the current defaults mean people are wasting cycles propagating spans internally that are never exported or passed down the call chain then that's a good argument for changing them.

@arminru
Copy link
Member

arminru commented Aug 12, 2019

related to #183

@tsloughter
Copy link
Member

It would make things a lot simpler if it was just a no-op tracer and require the SDK or other impl if context propagation is required.

As I say in #183 it is very unclear what is expected of this default tracer. And I think it leads to potentially more confusion and even bad data if the default tracer is propagating span context without creating new spans. So the DefaultTracer either needs to be no-op or it needs to be the regular tracer found in the SDK.

The only other option I can think of is the default tracer acts the same as the SDK tracer for span start and finish but anything else like recording events, adding attributes, etc would be no-ops.

Hm, or, the default tracer could be the same as the SDK tracer only when called with a remote parent. That way when the remote context comes in to B in a sequence like A->B->C a new span is created and propagated to C but any span creation within B is a no-op since it is a local parent span.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 4, 2019

Hi all, trying to resolve this ticket for the v0.2 release.

For now, I suggest we resolve this by saying the the default will be a No-op for the time being. Some reasoning below.

Property of least surprise

A noop implementation is the most intuitive result, when no SDK is present. It is not possible to correctly intuit any other behavior, one must read the docs. If one has not installed an SDK or taken other action, what are the chances that one has read the docs? Extra headers added to your requests out of nowhere is surprising behavior indeed.

Unexplored edge cases

We have not fully explored all the ways that this could turn sour. Here are some examples:

  • Untrustworthy observations: A sequence A->B->C, where B has no SDK but propagates context. The backend reports the sequence A->C. Operators take mistaken action, assuming that A is talking directly to C. Would you believe anything your observability system reports, after being bitten by this?
  • Turning it off: if there is automatic behavior as a result of adding the API to your library, then there is no way to disable it without configuration. If disabling this behavior requires a code change and a redeploy, that is essentially the equivalent of installing an SDK. Any other approach to configuration would involve the API aggressively touching additional surface areas of the process, such as environment variables or command line arguments. This is generally considered bad behavior by a library; many people would not like it. Not allowing propagation to be disabled opens the door to leaking data, unavoidable overhead, and other problems.
  • Calculating fingerprints: changing the header size can lead to things like oauth not working, load balancers truncating headers, and other surprises. Can you imagine trying to track this down to automated behavior in a library you had not heard of or explicitly enabled? Yikes.

Some of the above is hand wavy. My intention is to show that enabling this behavior is not a simple choice. In a world where we will be asking many, many libraries to natively integrate our API, avoiding surprising (and potentially dangerous) behavior should be a priority. I suggest that even though there are some advantages to propagation-by-default, for now they are outweighed by the risks and unknowns.

@tedsuo tedsuo self-assigned this Oct 4, 2019
@Oberon00
Copy link
Member

Oberon00 commented Oct 4, 2019

Regarding your first point ("Untrustworthy observations"): There was discussion in Python (open-telemetry/opentelemetry-python#159) to solve this by generating new Span IDs even in the API-only propagation. Then instead of A->C you would have a broken trace of two parts A and ???->C. However, I don't like that approach either because it (a) requires additional code (span ID generation) in the API that is also not free to execute at runtime and (b) I actually prefer the "untrustworthy" observation. It is always just a question of level of detail: If service B is a Layer-2 router it will also not be visible in the trace.

One IMHO particularly important point from that disussion: The "untrustworthy observation" may also occur if the SDK is installed and configured but the spans for A and C are sampled while the one for B is not.

@tsloughter
Copy link
Member

+1 to no-op.

@carlosalberto
Copy link
Contributor Author

+1 to no-op for now.

@SergeyKanzhelev
Copy link
Member

Enabling API everywhere is a long term goal. For now API enablement is quite involved process. Mostly people will enable it together with SDK. So API-only behavior can also be seen as API with the "disabled" SDK behavior. Where propagation is expected from my opinion.

Also with the move to separation of context propagation into a separate package in v0.3 that is discussed in oteps - automatic context propagation even in absence of SDK may again be a reality.

So I'd recommend to keep propagation behavior in API-only mode and wait for end user feedback.

@c24t
Copy link
Member

c24t commented Oct 8, 2019

Some comments on "untrustworthy observations" only:

There was discussion in Python ... to solve this by generating new Span IDs even in the API-only propagation

FWIW my reading of the spec is that we always have to generate new span IDs, proxies excepted.

@SergeyKanzhelev is actually the author of that part of the spec: w3c/trace-context#108. He may be able to give us the word of god here.

The "untrustworthy observation" may also occur if the SDK is installed and configured but the spans for A and C are sampled while the one for B is not.

It's technically possible, but I don't think this is likely to happen in practice. The client that creates the first span makes the sampling decision, and downstream clients can either respect that decision or choose not to sample.

This is the difference between the ProbabilitySampler (respects parent sampling decision) and AlwaysOnSampler (doesn't) in OC, and the reason we default to ProbabilitySampler with probability 1.

@c24t
Copy link
Member

c24t commented Oct 8, 2019

My intention is to show that enabling this behavior is not a simple choice. In a world where we will be asking many, many libraries to natively integrate our API, avoiding surprising (and potentially dangerous) behavior should be a priority.

Well said. I agree that we should be careful about making this library too magical and stick to conservative defaults. And that there are a lot of edge cases still to consider.

To make sense of these examples we have to distinguish in- and inter-process context propagation. AFAICT, in every language, the API package itself doesn't change outbound headers. For example:

  • The JS library ships with plugins for http and grpc, and a separate package to load these plugins automatically.
  • In python we provide WSGI middleware to extract the trace context from inbound requests and insert it into the application context, and a requests integration to attach trace context headers to outbound http requests made via the requests library.

The question here isn't whether we're going to add extra headers to requests, but whether applications need to depend on the SDK (in addition to whichever plugins they need) to propagate the trace context.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 9, 2019

@c24t I agree, I do not see how distributed context propagation can occur without installing plugins... so we are talking about a scenario where the are plugins already installed. Eventually native instrumentation become may become popular, and all software libraries could automatically provide context/header propagation. Until that happens, this scenario concerning automated context propagation – without knowledge or action on the part of the application developer – is pretty much an edge case. I don't think there is an obvious way to enable API-only behavior.

More realistically, at this time, I suggest we focus on providing the following for our users:

  • sane SDK defaults - if the config defaults for all of our SDKs are different, and do not compose into a working system, there will be breakages. So we should define critical SDK defaults in the spec.
  • Automatic installation and setup - to the degree possible, every language should provide an installer which automates the setup of both SDK and plugins, which will remove any room for configuration error.

This combo – a zero-config auto-installation – simplifies the process of organizational adoption.

@SergeyKanzhelev I do understand the intent - developers are bad at configuring their systems, and distributed tracing systems benefit from every individual service being properly configured. So it would be great if we could enable context propagation without requiring the application developer to take action. However, for the reasons stated above, I do not believe that an API-only approach would effectively solve this problem any time soon. And since any default behavior creates edge cases which have not been addressed, it seems like this idea needs more time to incubate. Up till now, we've taken a conservative approach to adding new features which have not been validated - I suggest we continue to think about this feature (preferably via an RFC), and for now ship with a no-op implementation.

@SergeyKanzhelev
Copy link
Member

Interestingly that the spec is currently already saying this while the decision that was made in v0.1 was to use W3C everywhere. I'm not sure what to do with this issue as it suggest to make a change that is already in a spec (is there any other mentions I am missing?)

The implementation MUST provide no-op binary and text Propagators, which the Tracer SHOULD use by default if other propagators are not configured. SDKs SHOULD use the W3C HTTP Trace Context as the default text format. For more details, see trace-context.

As for the arguments - we all saying the same words and listing the same assumptions about the state of the world. However we getting to the different conclusion somehow.

I believe that this project with upcoming improvements like context propagation separation, pre-installed OT collector exporter, introducing logging support is going to the direction when the mere fact of presence of OT API must initiate propagation of context. So earlier we will declare that - easier it will be to explain to end users.

This is what every partner in W3C discussions is looking for. End users control less components in their infrastructure and no matter whether they can enable monitoring or not - they want to have an end-to-end visibility. So ideally, the context propagation is a feature of a platform itself. In other words, the feature of an API-only deployments.

Anyway, I'd suggest we change the wording in specs to the W3C in API-only. And get the end-user feedback.

Thoughts?

@tsloughter
Copy link
Member

@SergeyKanzhelev that spec only says what to do when crossing process boundaries. Does context propagate within the process when no SDK is enabled? And if it does, how? How here referring to what is the current active context supposed to be when only the API is loaded and a start_span is encountered? Does the parent span context remain the active span?

@SergeyKanzhelev
Copy link
Member

@tsloughter it's a good question that definitely need a clarification, even for the SDK-only case. I'd say whatever works with the propagators must read/write contexts all the time. Spans needs to be associated with the context. However for inner spans - tracer may return the same span all the time or reuse active.

@carlosalberto
Copy link
Contributor Author

Closing this issue and opening a new one, as propagation went away from the Tracer component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants