Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing opencensus-shim breakage/inconsistencies #6876

Closed
dmarkwat opened this issue Oct 14, 2022 · 2 comments
Closed

Addressing opencensus-shim breakage/inconsistencies #6876

dmarkwat opened this issue Oct 14, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@dmarkwat
Copy link
Contributor

dmarkwat commented Oct 14, 2022

Is your feature request related to a problem? Please describe.
The problem I've encountered is the opencensus-shim, when used with the javaagent, misbehaves and doesn't work the same when the javaagent is not present (I'd go so far as to say: it breaks).

tl;dr: due to the otel javaagent's internals and resulting requirements for properly bridging application and agent Contexts, the very specific ApplicationSpanImpl is required to affect changes in the stored Context. This is something integrations like opencensus-shim do not handle correctly, but otherwise correctly conform to the otel interfaces and API requirements. The result is unexpected -- and on its face, unexplainable -- breakage when the javaagent is used.

I've created an instrumentation that addresses this problem, and even wires up a "baseline" test without the java agent to prove out the situation. However, the solution is nothing that pleases me as it requires the use of reflection directly, and since the java agent works just fine as-is, this felt like more of an "enhancement" than a bug...but I'm torn:

What I outline below is the direct result of the otel instrumentation code placing an implicit, undocumented (afaict anyway) requirement that a specific type of io.opentelemetry.trace.Span be passed back to it for normal operation (details in Additional Context), in a world where the interfaces would -- and in most cases I've encountered do -- suffice. Add to that, there's nothing patently wrong with the way integrations like opencensus-shim behave when they conform to the interfaces and work as expected when the javaagent isn't present.

And while that's certainly what I've come to see as a good indicator of "instrumentation required here", in this case it's instrumenting the instrumented API (opencensus-shim) and its associated wrappers themselves. There's no interface to determine if anything is being wrapped and there's no requirement for the Span interface, for example -- in this case OpenTelemetrySpanImpl, specifically -- to delegate all its otel interface methods to the wrapped Span. And in this specific scenario, that would avert this issue entirely if it did so.

So what's the answer? Do we:

  • instrument the code as I've done (or in some other way), messy as it sometimes can be, ever struggling against moving targets without interfaces to adequately inform the instrumentors on how/when to properly delegate/unwrap, etc.?
  • OR, impose/implement the change in the upstream (in this case, opencensus-shim) to delegate all the methods to the wrapped object despite the sufficient conformance to the documented API?

I suppose the latter could have accompanying doc authored for the case and would be preferable when it's under direct control, but is that reasonably always going to be the case?

And, thinking further: I haven't looked too closely at opentracing-shim, but at a glance it looks like it may also fall victim to this problem as well? It's unfortunately a fair bet, and it makes me wonder: are there more out there or otherwise waiting to happen?

Describe the solution you'd like
If I had to pick, on balance, I'd go with option B above: implement the delegation changes in the upstream (in this case, specifically OpenTelemetrySpanImpl in opencensus-shim) and accompany it with documentation/guidance instructing how and when to delegate the calls as part of the standard API.

But, I'm here because I'm definitely interested in what anyone else thinks about this problem, if they've encountered it, and how to solve it. (Or heck: am I completely off my rocker here and this isn't a problem?)

Describe alternatives you've considered
I considered instrumentation of course, as I wrote one to at least solve the acute problem, but I honestly struggled writing it (with an undesirable result no less) for a few reasons, the sum of which weighs it heavily out of favor in this case for me personally:

  • writing wrappers for OpenTelemetrySpanImpl, for example, proved difficult (impossible?) because the interface shares that of the javaagent classes io.opentelemetry... and not application.io.opentelemetry...: bridging that gap did not seem like a straight line to me even after looking at the opentelemetry instrumentations
  • creating method overrides would have been necessary here (for OpenTelemetrySpanImpl certainly) and isn't something the byte buddy Advice method transformers can do, and getting the heavier-weight transformer to work proved challenging (the muzzle guide also dissuaded me heavily from this approach, as well)
  • the final, shortest, simplest result I could achieve using Advice relies on reflection directly and is itself brittle

Additional context
The is a detailed accounting of the problem and the path one needs to follow to see it (from a cold start like I did).

Click here to expand...it's rather long

I tracked the issue down to the following, detailed for posterity and newcomers like myself:

  • As part of javaagent instrumentation, all context management is brokered by AgentContextStorage and in turn all context manipulation is brokered by AgentContextWrapper
  • When performing actions such as Span#storeInContext, the call tree follows through whichever Context instance and, barring some intense overriding of methods, ultimately lands in AgentContextWrapper#with
  • Once there, it traverses the ContextKeyBridges
  • When that happens, the incoming Span is subjected to this test, and in the case of opencensus-shim it fails but results in no errors or other meaningful output

It fails because in opencensus-shim, the incoming Span isn't an ApplicationSpan, but an OpenTelemtrySpanImpl which wraps the ApplicationSpan which was returned from calls to the various otel-instrumented API methods

In my specific case, I created an Otel Span and attempted to use it as the parent of an opencensus span (implicitly occurring via using the GCP Spanner client + opencensus-shim on my classpath):

  • I created an otel span and made it current
  • I then called the spanner client methods to start executing a read against a table which constructs its own spans
  • At this point, opencensus-shim starts building a new Span to track its work, using io.opencensus.trace.Tracer#spanBuilder which uses the current Context as the parent:
    • it correctly pulled the current otel Context out of the instrumented AgentContextStorage as was expected (via OpenTelemetryContextManager), and wrapped it in an oc-shim OpenTelemetryCtx
    • through the call chain to obtain an opencensus Span, this OpenTelemetryCtx was then used to obtain an OpenTelemetrySpanImpl via call to SpanConverter, which is done here
    • this then propagates down to OpenTelemetrySpanBuilderImpl where the parent is attempted to be assigned to the current context, in effect creating that parent relationship and creating an identity for the current span
    • ⚠️ this technically fails due to the aforementioned, but it works because the current context happens to be the parent: a near miss
    • the OpenTelemetrySpanImpl is then constructed, and all is well so far: the span and trace ids are accurate, and the parent is assigned from Context#current()
  • This newly-created Span is then attempted to be scoped and made current via Tracer#withSpan, and this is where the breakdown occurs:

The result: the current Span is never assigned as current and calls to Context#current continue to return the parent context created outside and before the calls to the opencensus-shim realm.

The effect: upon ending spans and closing scopes, it's possible -- and indeed happened in my case -- that the otel Span sitting atop the opencensus trace tree is prematurely closed. Numerous knock-on effects happen at that point too, of course, and are interesting but altogether moot: with that top Span closed, anything up to nearly everything suddenly becomes invalid or inaccurate.

As part of writing a new instrumentation for an as-yet-unsupported framework (finagle), I was testing against some GCP clients I've worked with and the local emulators (in my case, Spanner) and discovered opencensus-shim simply doesn't work as expected when the otel javaagent is loaded and running.

@trask
Copy link
Member

trask commented Aug 10, 2023

I think this was resolved by #4900

@trask trask added the needs author feedback Waiting for additional feedback from the author label Aug 10, 2023
@dmarkwat
Copy link
Contributor Author

I think so, yes. For posterity, the related issue is open-telemetry/opentelemetry-java#4900 (in opentelemetry-java, not this project, opentelemetry-java-instrumentation).

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants