Addressing opencensus-shim breakage/inconsistencies #6876

dmarkwat · 2022-10-14T02:14:16Z

Is your feature request related to a problem? Please describe.
The problem I've encountered is the opencensus-shim, when used with the javaagent, misbehaves and doesn't work the same when the javaagent is not present (I'd go so far as to say: it breaks).

tl;dr: due to the otel javaagent's internals and resulting requirements for properly bridging application and agent Contexts, the very specific ApplicationSpanImpl is required to affect changes in the stored Context. This is something integrations like opencensus-shim do not handle correctly, but otherwise correctly conform to the otel interfaces and API requirements. The result is unexpected -- and on its face, unexplainable -- breakage when the javaagent is used.

I've created an instrumentation that addresses this problem, and even wires up a "baseline" test without the java agent to prove out the situation. However, the solution is nothing that pleases me as it requires the use of reflection directly, and since the java agent works just fine as-is, this felt like more of an "enhancement" than a bug...but I'm torn:

What I outline below is the direct result of the otel instrumentation code placing an implicit, undocumented (afaict anyway) requirement that a specific type of io.opentelemetry.trace.Span be passed back to it for normal operation (details in Additional Context), in a world where the interfaces would -- and in most cases I've encountered do -- suffice. Add to that, there's nothing patently wrong with the way integrations like opencensus-shim behave when they conform to the interfaces and work as expected when the javaagent isn't present.

And while that's certainly what I've come to see as a good indicator of "instrumentation required here", in this case it's instrumenting the instrumented API (opencensus-shim) and its associated wrappers themselves. There's no interface to determine if anything is being wrapped and there's no requirement for the Span interface, for example -- in this case OpenTelemetrySpanImpl, specifically -- to delegate all its otel interface methods to the wrapped Span. And in this specific scenario, that would avert this issue entirely if it did so.

So what's the answer? Do we:

instrument the code as I've done (or in some other way), messy as it sometimes can be, ever struggling against moving targets without interfaces to adequately inform the instrumentors on how/when to properly delegate/unwrap, etc.?
OR, impose/implement the change in the upstream (in this case, opencensus-shim) to delegate all the methods to the wrapped object despite the sufficient conformance to the documented API?

I suppose the latter could have accompanying doc authored for the case and would be preferable when it's under direct control, but is that reasonably always going to be the case?

And, thinking further: I haven't looked too closely at opentracing-shim, but at a glance it looks like it may also fall victim to this problem as well? It's unfortunately a fair bet, and it makes me wonder: are there more out there or otherwise waiting to happen?

Describe the solution you'd like
If I had to pick, on balance, I'd go with option B above: implement the delegation changes in the upstream (in this case, specifically OpenTelemetrySpanImpl in opencensus-shim) and accompany it with documentation/guidance instructing how and when to delegate the calls as part of the standard API.

But, I'm here because I'm definitely interested in what anyone else thinks about this problem, if they've encountered it, and how to solve it. (Or heck: am I completely off my rocker here and this isn't a problem?)

Describe alternatives you've considered
I considered instrumentation of course, as I wrote one to at least solve the acute problem, but I honestly struggled writing it (with an undesirable result no less) for a few reasons, the sum of which weighs it heavily out of favor in this case for me personally:

writing wrappers for OpenTelemetrySpanImpl, for example, proved difficult (impossible?) because the interface shares that of the javaagent classes io.opentelemetry... and not application.io.opentelemetry...: bridging that gap did not seem like a straight line to me even after looking at the opentelemetry instrumentations
creating method overrides would have been necessary here (for OpenTelemetrySpanImpl certainly) and isn't something the byte buddy Advice method transformers can do, and getting the heavier-weight transformer to work proved challenging (the muzzle guide also dissuaded me heavily from this approach, as well)
the final, shortest, simplest result I could achieve using Advice relies on reflection directly and is itself brittle

Additional context
The is a detailed accounting of the problem and the path one needs to follow to see it (from a cold start like I did).

Click here to expand...it's rather long

I tracked the issue down to the following, detailed for posterity and newcomers like myself:

As part of javaagent instrumentation, all context management is brokered by AgentContextStorage and in turn all context manipulation is brokered by AgentContextWrapper
When performing actions such as Span#storeInContext, the call tree follows through whichever Context instance and, barring some intense overriding of methods, ultimately lands in AgentContextWrapper#with
Once there, it traverses the ContextKeyBridges
When that happens, the incoming Span is subjected to this test, and in the case of opencensus-shim it fails but results in no errors or other meaningful output

It fails because in opencensus-shim, the incoming Span isn't an ApplicationSpan, but an OpenTelemtrySpanImpl which wraps the ApplicationSpan which was returned from calls to the various otel-instrumented API methods

In my specific case, I created an Otel Span and attempted to use it as the parent of an opencensus span (implicitly occurring via using the GCP Spanner client + opencensus-shim on my classpath):

I created an otel span and made it current
I then called the spanner client methods to start executing a read against a table which constructs its own spans
At this point, opencensus-shim starts building a new Span to track its work, using io.opencensus.trace.Tracer#spanBuilder which uses the current Context as the parent:
- it correctly pulled the current otel Context out of the instrumented AgentContextStorage as was expected (via OpenTelemetryContextManager), and wrapped it in an oc-shim OpenTelemetryCtx
- through the call chain to obtain an opencensus Span, this OpenTelemetryCtx was then used to obtain an OpenTelemetrySpanImpl via call to SpanConverter, which is done here
- this then propagates down to OpenTelemetrySpanBuilderImpl where the parent is attempted to be assigned to the current context, in effect creating that parent relationship and creating an identity for the current span
- ⚠️ this technically fails due to the aforementioned, but it works because the current context happens to be the parent: a near miss
- the OpenTelemetrySpanImpl is then constructed, and all is well so far: the span and trace ids are accurate, and the parent is assigned from Context#current()
This newly-created Span is then attempted to be scoped and made current via Tracer#withSpan, and this is where the breakdown occurs:
- as we work down the Tracer#withSpan call tree, we encounter ContextHandleUtils#withValue inside the ScopeInSpan class which in turn calls the opencensus-shim's OpenTelemetryContextManager#withValue method
- this call is the start of the call chain that results in the behavior outlined above

The result: the current Span is never assigned as current and calls to Context#current continue to return the parent context created outside and before the calls to the opencensus-shim realm.

The effect: upon ending spans and closing scopes, it's possible -- and indeed happened in my case -- that the otel Span sitting atop the opencensus trace tree is prematurely closed. Numerous knock-on effects happen at that point too, of course, and are interesting but altogether moot: with that top Span closed, anything up to nearly everything suddenly becomes invalid or inaccurate.

As part of writing a new instrumentation for an as-yet-unsupported framework (finagle), I was testing against some GCP clients I've worked with and the local emulators (in my case, Spanner) and discovered opencensus-shim simply doesn't work as expected when the otel javaagent is loaded and running.

The text was updated successfully, but these errors were encountered:

trask · 2023-08-10T21:59:39Z

I think this was resolved by #4900

dmarkwat · 2023-08-13T17:12:20Z

I think so, yes. For posterity, the related issue is open-telemetry/opentelemetry-java#4900 (in opentelemetry-java, not this project, opentelemetry-java-instrumentation).

dmarkwat added the enhancement New feature or request label Oct 14, 2022

dmarkwat mentioned this issue Oct 14, 2022

Addresses opencensus-shim integration #6875

Closed

dmarkwat mentioned this issue Oct 31, 2022

Addresses opencensus-shim trace issues under otel javaagent open-telemetry/opentelemetry-java#4900

Merged

trask added the needs author feedback Waiting for additional feedback from the author label Aug 10, 2023

dmarkwat closed this as completed Aug 13, 2023

github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addressing opencensus-shim breakage/inconsistencies #6876

Addressing opencensus-shim breakage/inconsistencies #6876

dmarkwat commented Oct 14, 2022 •

edited

Loading

trask commented Aug 10, 2023

dmarkwat commented Aug 13, 2023

Addressing opencensus-shim breakage/inconsistencies #6876

Addressing opencensus-shim breakage/inconsistencies #6876

Comments

dmarkwat commented Oct 14, 2022 • edited Loading

trask commented Aug 10, 2023

dmarkwat commented Aug 13, 2023

dmarkwat commented Oct 14, 2022 •

edited

Loading