Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Tracing API vs Tokio-Tracing API for Distributed Tracing #1571

Open
cijothomas opened this issue Feb 23, 2024 · 42 comments
Open
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.

Comments

@cijothomas
Copy link
Member

cijothomas commented Feb 23, 2024

Background

The Rust ecosystem has two prominent tracing APIs: the OpenTelemetry Tracing API (Otel for short), delivered through the opentelemetry crate, and the Tokio tracing API, provided by the
tracing crate. The OTel Tracing API adheres to the OpenTelemetry specification, ensuring alignment with OpenTelemetry Tracing implementations in other languages like C++, Java etc. Conversely, the Tokio tracing ecosystem, which predates
OpenTelemetry, boasts widespread adoption, with many popular libraries already instrumented. The tracing-opentelemetry crate, maintained outside of OpenTelemetry repositories, act as a "bridge", enabling applications instrumented with tracing to work with OpenTelemetry.

The issue

The coexistence of the OTel Tracing API and Tokio-Tracing poses a dilemma, forcing end users to choose between two competing APIs. This situation complicates the decision-making process due to the absence of comprehensive
documentation comparing the two options. A significant concern is the lack of tested interoperability between the APIs, which can result in issues, especially in applications where different layers use different tracing APIs, potentially
leading to incomplete traces. This also impacts the log correlation scenarios as well.

A Comparison with OTel .NET

The OpenTelemetry .NET community encountered a similar challenge when the OTel Tracing API was introduced, as the .NET runtime library (shipped as the
DiagnosticSource package) already had a similar API in place. This issue was resolved through collaboration between OTel .NET maintainers and the .NET
runtime team, leading to the alignment of the .NET runtime's tracing API with the OTel specifications. This approach was later applied to the Metrics API as well. While the decision by OTel .NET to prioritize the .NET Runtime library's
API over its own for tracing/metrics has generally been successful, it has not been without its challenges. Despite declaring stability years ago, OTel .NET has yet to implement certain aspects of the OTel specification fully.

Although the outcomes in the .NET ecosystem might not directly forecast the success of similar efforts in Rust, they provide a valuable reference point.

Options for Consideration

  1. Deprecate Tokio-Tracing: This approach would align Rust with the OpenTelemetry strategies adopted by other languages. However, considering the popularity and active maintenance of the tracing crate in the Rust ecosystem, this path has highest friction and is highly improbable.

  2. Deprecate OTel Tracing: Promoting Tokio-Tracing as the standard could be a feasible option, albeit requiring comprehensive evaluation. This strategy would cause OTel Rust to deviate from its counterparts in other languages.
    Potential alignment of Tokio-Tracing with OTel Tracing specifications could mitigate this concern but necessitates groundwork to identify gaps and propose solutions. Tokio-Tracing maintainers have shown willingness to accommodate
    reasonable changes, pending a clear set of requirements. This option does not eliminate the OTel Tracing API completely, but it'll still remain to compensate for things missing from Tokio-Tracing - only those APIs which are overlapping/competing with Tokio-Tracing needs to be deprecated/removed.

  3. Maintain Both APIs: This alternative emphasizes the importance of ensuring seamless interoperability between the two APIs, allowing users to choose based on preference or specific needs without compromising trace completeness. Achieving this goal requires significant effort to identify and bridge any existing gaps in the interoperability story. Users should be able freely chose between, without worrying about any broken traces.

  4. Do nothing.: OTel Rust has some special accommodations done to help tracing crate (and vice-versa). We can just remove them, and let each crate follow their own destiny. (Highly undesirable state, just listed for completion)

Are there more options? Please let us know in the comments!

Current State

The Rust tracing ecosystem is at a critical juncture. Active discussions between the OTel Rust team and the Tracing Rust team are taking place, with updates and deliberations shared on Cloud Native
Slack
. Interested individuals are encouraged to join the discussion on Slack (or right in this Github issue). All decisions and considerations will be posted on GitHub as well for wider visibility and to gather feedbacks.

Timeline

Resolving this issue is a prerequisite (though not the only one) for declaring the Tracing signal as GA (General Availability) for OTel Rust. Given the goal to achieve Tracing GA (alongside other milestones) soon, it's crucial that this issue is resolved promptly. A tentative deadline to reach a decision on the chosen path forward is set for April 30th, 2024, approximately 2 months from today.

Related issues

#1378 Tracing Propagation.
#1394 (comment)
Broken Trace example : #1690

@cijothomas cijothomas added A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA. A-log Area: Issues related to logs labels Feb 23, 2024
@cijothomas cijothomas pinned this issue Feb 23, 2024
@cijothomas
Copy link
Member Author

Tagging @open-telemetry/rust-approvers
@jtescher as tracing-opentelemetry maintainer
@davidbarsky as tracing maintainer

@TommyCpp
Copy link
Contributor

If we were to use tracing as the API. This is the deviation between existing tracing API and Otel tracing API

  • - means no support today
  • ? means no support but pontential way to implement it
  • + means supported
Feature Tokio Tracing
TracerProvider
Create TracerProvider -
Get a Tracer -
Get a Tracer with schema_url -
Get a Tracer with scope attributes -
Associate Tracer with InstrumentationScope -
Safe for concurrent calls +
Shutdown (SDK only required) N/A
ForceFlush (SDK only required) N/A
Trace / Context interaction
Get active Span +
Set active Span +
Tracer Optional
Create a new Span +
Documentation defines adding attributes at span creation as preferred +
Get active Span +
Mark Span active +, same as tracing enter
SpanContext
IsValid ?, can add as event or metadata
IsRemote ?, can add as event or metadata
Conforms to the W3C TraceContext spec -
Span Optional
Create root span +
Create with default parent (active span) +
Create with parent from Context +
No explicit parent Span/SpanContext allowed +
SpanProcessor.OnStart receives parent Context -
UpdateName -
User-defined start timestamp -
End +, when span closes(not exit)
End with timestamp -
IsRecording ?, can add attributes or metadata
IsRecording becomes false after End ?
Set status with StatusCode (Unset, Ok, Error) ?, can add as event or metadata
Safe for concurrent calls +
events collection size limit -
attribute collection size limit -
links collection size limit -
Span attributes Optional
SetAttribute +
Set order preserved -
String type +
Boolean type +
Double floating-point type +
Signed int64 type +
Array of primitives (homogeneous) +
null values documented as invalid/undefined N/A
Unicode support for keys and string values +
Span linking Optional
Links can be recorded on span creation ?, can add as metadata
Links can be recorded after span creation ?, can add as metadata
Links order is preserved ?, depends on subscriber
Span events
AddEvent +
Add order preserved ?, depends on subscriber
Safe for concurrent calls +
Span exceptions
RecordException ?, can add as events w/ special flags
RecordException with extra parameters ?, can add as events w/ special flags, see tracing-error
Sampling Optional
Allow samplers to modify tracestate N/A
ShouldSample gets full parent Context N/A
Sampler: JaegerRemoteSampler N/A
New Span ID created also for non-recording Spans
IdGenerators ?, not supported directly but can add via subscribers
SpanLimits -
Built-in SpanProcessors implement ForceFlush spec N/A
Attribute Limits -
Fetch InstrumentationScope from ReadableSpan -

@hdost
Copy link
Contributor

hdost commented Feb 24, 2024

From the metrics perspective exemplars are also something to take into account.

@hdost
Copy link
Contributor

hdost commented Mar 26, 2024

As requested in the community meeting:
I am partial towards Option 2.
Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

  • What are the minimal set of features we feel like we'd want to maintain?
  • Are the tokio team amenable to supporting such features?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

@ramosbugs
Copy link
Contributor

I am partial towards Option 2.
Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

As a heavy user of direct OpenTelemetry instrumentation (e.g., using SpanBuilder a lot, along with span events and span links) in the backend of my AWS Lambda-based web app, I'm nervous reading this. There are almost 800 calls to set_attribute alone in my codebase, and moving off of a deprecated API used this heavily would be a major undertaking.

Which interfaces specifically would be deprecated?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

I suspect a lot of other OpenTelemetry users are also doing so in private repositories, so I agree that it's hard to measure. I would caution against inferring much from these public GitHub usage stats.

@TommyCpp
Copy link
Contributor

Which interfaces specifically would be deprecated?

We don't know exactly yet. The idea is to bridge the gap between the tracing and OTEL spec using custom API but if something tracing supports we want to deprecate in favor of that. But we are still debating ideas and nothing has been decided yet. Thus, any feedback is greatly appreciated!

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

I vote for option 2, as there are challenges with other options:

  • Option 1: Tokio-Tracing is deeply ingrained in the Rust ecosystem with wide adoption across many libraries and applications. It's practically not feasible to ask community to migrate from tokio-tracing to Otel tracing API.
  • Option 3: Keeping OpenTelemetry API, while also supporting tracing-opentelemetry would increase maintenance effort., add confusion to the developers on which API to choose, and interop across both would be a challenge (as seen here)
  • Option 4: This approach, while minimizing immediate effort, overlooks the broader implications for the Rust ecosystem's health and future growth.

Going with Option 2, we also need evaluation for introducing an extension API within OpenTelemetry. This is to effectively bridge the existing gaps between the OTel specifications and Tokio-Tracing's functionalities (e.g, Baggage support, Propagators).

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Direct consumption of opentelemetry-api could be for traces, metrics and logs, and I agree it is really hard to get the actual statistics for "traces" only :)

@julianocosta89
Copy link
Member

OpenTelemetry comes from OpenCensus and OpenTracing merge.
The deprecation of those 2 parent projects took some time, but it happened.

IDK if I have a saying because I don't maintain the OTel Rust, but I'd vote for Option 1 and invite the maintainers of tokio-tracing to join the OTel project as maintainers/approvers.
Basically continue what they are doing, but under CNCF umbrella and OpenTelemetry as main project.

IDK how much tokio-tracing follows the OTel specification and semantic convention, but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust.

  • tokio-tracing for traces
  • OTel for metrics, logs and profiling ...

I'm biased but I see OTel as the future for Observability signals.

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

Tagging for more inputs.
@hawkw, @davidbarsky, as tokio-tracing maintainers
@jtescher as tracing-opentelemetry, and opentelemetry-rust maintainer

@TommyCpp
Copy link
Contributor

TommyCpp commented Mar 27, 2024

but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust

Just to provide context here. I think if we move to tracing as the API for Otel Rust. We could unify all 3 signals under tracing.

  • Logs, per spec we didn't define a new API in Otel Rust. We have bridge for log facade and tracing facade
  • Metrics, I believe tracing has some WIP to support metrics collection.

@jtescher
Copy link
Member

I haven't had much time recently to work on open source, but my perspective is that option 3 is likely optimal in the near term. I suspect that expressing the full otel API via tracing would be difficult and likely require some changes to the underlying library which would be orthogonal to their current designs and goals (trace ids on span creation, metrics in general, etc). It may be possible to express them via a large range of special fields but it seems likely that it would be worse than the current two API confusion. Someone could do a proof of concept trying to unify them though to be sure.

Option 3 could be done via clearer purposes for each API (e.g. low level "full" api via otel, or high level "limited but ergonomic and user-friendly" api via tracing macros) and examples of suggested architectural patterns (e.g. otel API "between" application boundaries, tracing within applications and crates, or similar sets of suggestions). But as already mentioned here it is somewhat more cumbersome and confusing than a consistent single API. Being not fully in control of the otel spec or the log/tracing ecosystems means the rust otel stack finds itself somewhat stuck in the middle.

@davidbarsky
Copy link

I'm on vacation, so I'll be brief and try to expand next week/summarize my thoughts from Slack: pulling a .NET (paying attention to the intent, not the letter of the spec) is very much possible, down to the fact that propagators remained in a dedicated OTEL library for 2.5 years. I think tracing could have a native notion of propagators, but I don't think we have the bandwidth to figure that out/I'd rather wait for Tower to reach 1.0 before making decisions on that front.

@cijothomas
Copy link
Member Author

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

@hdost
Copy link
Contributor

hdost commented Mar 30, 2024

I guess I'll try to take a look at how we could go for option 3.

From the top of my head use cases to look at:

  • Apps + Libs Instrumented with tracing + tracing-opentelemetry + an exporter
    • this is the basic tracing to OTel case.
  • Apps + Libs Instrumented with opentelemetry + opentelemetry-appender-tracing
    • this is the basic Otel to tracing case.

Then some variant of the two where both tracing and Otel are used for instrumentation.

Those will be "advanced cases", but honestly it might be more common than one might think.

@cijothomas
Copy link
Member Author

cijothomas commented Apr 9, 2024

Comment/Discussion from Community Meeting for option3:

Test to validate the option3
A -> B -> C

A - uses tracing for producing span
B - uses otel tracing api for producing span
C - uses tracing for producing span

3 spans
SpanA
SpanB (parent=SpanA)
SpanC (parent=SpanB)

It may not be feasible to ask users to use same api for all 3, as they may not own/control some of them. eg: B could be reqwest crate.

#1378 (comment) shows an examples where logging and tracing (distributed tracing aka spans) are used, and correlation is broken when tracing crate is used to produce span, instead of otel tracing api.

@cijothomas
Copy link
Member Author

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

Took some time to get to this due to other priorities, but here are more details on one possible way to go with option2, including a prototype:
#1689

@diurnalist
Copy link

diurnalist commented Apr 30, 2024

👋🏻 I am not a Rust developer so am coming from a very different perspective. My take is that, to my knowledge, every other ecosystem has opted for Option 1 long-term, Option 3 near-term. Specifying the API in OTel was (I assume) a large effort and we have seen the API evolve as developers have battle-tested it and provided feedback (e.g., lack of a synchronous gauge instrument, which is now in the spec.) My impression is that spec evolution is a pretty collaborative process, which is nice to observe.

In my view it would be a mistake to align on pre-existing instrumentation conventions as OTel's mission has been to provide a standard API that instrumentations across languages/systems can adhere to. This is particularly important as it provides a path for libraries to provide instrumentation hooks to, e.g. automatically generate traces and metrics as part of their own business logic, kind of like bpf kernel tracepoints or UDST. And those hooks are written according to a wider specification and hence less vulnerable to governance issues that tend to come up in external libraries from time to time.

In the Go OTel SDK there are several "bridge" interfaces that help to close the gap b/w the OTel API and existing instrumentation libraries, e.g., the opencensus bridge. Perhaps this would be a way to pave the path towards wider OTel API adoption.

/$0.02 🙇🏻

@cijothomas
Copy link
Member Author

As requested in the community meeting: I am partial towards Option 2. Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

  • What are the minimal set of features we feel like we'd want to maintain?
  • Are the tokio team amenable to supporting such features?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

@hdost After re-reading this, I am not entirely sure if I understand the part where you said "I mentioned during the meeting we will still need to have "some" API anyway" I think @TommyCpp also mentioned this (in metrics context though).

If you look at the prototype, it has tracing sdk only! No tracing api. i.e there is nospan/span.start()/end() etc. We'll need opentelemetry crate itself, where we need to expose APIs for things not covered by tokio-tracing. Eg: Baggage, Propagators, Metrics, LogBridge etc.
https://github.com/cijothomas/opentelemetry-tracing/blob/main/src/opentelemetry_sdk.rs

Could you check this. We can discuss in the next SIG call and figure out what are the gaps in our understanding.

@cijothomas
Copy link
Member Author

cijothomas commented Jul 30, 2024

Update from July 30 OTel Rust Community Meeting:

We recognize it’ll be a while before this can be fully sorted out. We continually see issues - upgrades are hard, otel-demo is broken, and users are unsure which versions are compatible and the list goes on.

To mitigate the short/medium term pain, while also being not-too-far from the long term plans, it was decided to offer tracing integration in the OpenTelemetry-SDK itself under a feature flag. This feature will start recognizing tracing Spans without the need for bridging via tracing-opentelemetry. In short, a lot of functionalities from tracing-opentelemetry gets absorbed into the SDK itself. This is something that'll be needed in option 2 and 3 anyway, so this is not drifting off too far from 2,3 which are currently the leading candidates.

Note that this does not support interoperating both APIs for spans - either use tracing or otel tracing api, but mixing them up won't work. If option 3 is settled on, then this will need to be solved, but not part of the immediate release.

This does not deprecate tracing-opentelemetry (We don’t own it to deprecate), but once the above is ready, tracing-maintainers can decide if they want to deprecate tracing-opentelemetry.
And this does not deprecate otel-tracing either.

@TommyCpp will make the above happen and we are targeting to include it in the next release (~Aug 30)

luisholanda added a commit to luisholanda/kiso that referenced this issue Sep 1, 2024
After some experiments using our custom tracing implementation, I
reached the conclusion that it isn't worth the extra complexity. Most
of the ecossystem is using tokio's tracing implementation, and the Rust
OpenTelemetry WG are evaluating replacing their implementation with only
the tracing layer[1].

Thus, I decided to replace it with a tracing integration setup. The
setup is pretty standard, but the implementation uses a custom Layer
to pass data from tracing to OpenTelemetry, continuing to use the
background worker to do most of the heavy lifting. This makes the
hot path that runs in the application loop to be more efficient.

The implementation also uses a custom context to allow for faster
retrieval of tracing information for propagation.

The result is that kiso's users don't have to worry about OpenTelemetry
crates unless they need dynamic attributes or links. Everything else is
handled by the tracing crate.

This commit contains only the necessary code for the migration. There
are still some things to sort out, and primarily performance
improvement to implement. These will be done in other patches as this
one is already too big to properly review.

[1]: open-telemetry/opentelemetry-rust#1571
@cijothomas
Copy link
Member Author

Update from July 30 OTel Rust Community Meeting:

We recognize it’ll be a while before this can be fully sorted out. We continually see issues - upgrades are hard, otel-demo is broken, and users are unsure which versions are compatible and the list goes on.

To mitigate the short/medium term pain, while also being not-too-far from the long term plans, it was decided to offer tracing integration in the OpenTelemetry-SDK itself under a feature flag. This feature will start recognizing tracing Spans without the need for bridging via tracing-opentelemetry. In short, a lot of functionalities from tracing-opentelemetry gets absorbed into the SDK itself. This is something that'll be needed in option 2 and 3 anyway, so this is not drifting off too far from 2,3 which are currently the leading candidates.

Note that this does not support interoperating both APIs for spans - either use tracing or otel tracing api, but mixing them up won't work. If option 3 is settled on, then this will need to be solved, but not part of the immediate release.

This does not deprecate tracing-opentelemetry (We don’t own it to deprecate), but once the above is ready, tracing-maintainers can decide if they want to deprecate tracing-opentelemetry. And this does not deprecate otel-tracing either.

@TommyCpp will make the above happen and we are targeting to include it in the next release (~Aug 30)

This work is delayed, and won't be part of the coming release (expected in a day). Will post new ETA for this soon.

@nastynaz
Copy link

@cijothomas is there an update on when it will be implemented?

@cijothomas
Copy link
Member Author

@nastynaz Sorry, we don't have a new ETA yet. @TommyCpp got busy with work commitments, and other maintainers are busy pushing metrics, logs to RC/stable.

@scottgerring
Copy link
Contributor

I'd love to see smooth integration between (tokio's)tracing and OTEL. I recently sunk some time into setting up OTEL+Rust and went down a fairly substantial rabbit hole probing the different combinations of pieces to get things working properly for the different combinations of rust libraries & instrumentation options one might expect.

Which is to say, I think option 3 is realistically what's happening, right now. Option 2 / deprecating OTEL in favour of tracing is bunk, because tracing doesn't cover the same set of use cases.

I'm going to take the time to fully enumerate the "rough edges" between tracing/OTEL as it stands and report back. I think a pragmatic next step here would be providing concrete guidance on how to set them up together, and at the same time trying to smooth the edges out. @jtescher 's comment here resonates:

Option 3 could be done via clearer purposes for each API (e.g. low level "full" api via otel, or high level "limited but ergonomic and user-friendly" api via tracing macros) and examples of suggested architectural patterns (e.g. otel API "between" application boundaries, tracing within applications and crates, or similar sets of suggestions)

@cijothomas
Copy link
Member Author

@scottgerring 3 is what is happening today, and that is the issue. If two of them are in such a way that one is layered on top of other, then that is totally fine (This is what Julian's comment is referring to as well, from what I can tell.)
But they are now two independent stacks with own understanding of Context - this is not something we want.

I'm going to take the time to fully enumerate the "rough edges" between tracing/OTEL as it stands and report back

Thanks ❤️ ! Really appreciate the help.

@cijothomas
Copy link
Member Author

@scottgerring https://cloud-native.slack.com/archives/C069U408RNW is a slack channel created dedicated to discussing this. Feel free to join, and use that for discussions as well.

@scottgerring
Copy link
Contributor

Hey @cijothomas thanks for the enthusiastic welcome!
I'll jump into the slack and start documenting my experience with the libraries this week.

If two of them are in such a way that one is layered on top of other, then that is totally fine

In my (likely rather incomplete) mental model, it feels like OTEL should sit "beneath" tracing, in the case where tracing is in-use, whilst still being able to stand on its own when it is not; I think this is what you and Julian are thinking too?

@scottgerring
Copy link
Contributor

scottgerring commented Nov 22, 2024

I took advantage of my "beginners mind", and ported an application we have setup with OTEL-only integration to use tracing-opentelemetry here - otel setup here. I hope this will provide some concrete data on points that could be addressed when bringing the bridge "in-house".

I am using:

This should be indicative of the sort of common stack we should expect folks to use.

flowchart TD
    A[Actix App] -->|Logs and spans| B[Tracing]
    B -->|Subscriber: tracing-opentelemetry| C[tracing-opentelemetry]
    B -->|Subscriber: opentelemetry-appender-tracing| D[opentelemetry-appender-tracing]
    C -->|Pushes Traces| E[OpenTelemetry Traces API]
    D -->|Pushes Logs| F[OpenTelemetry Logs API]
    E -->|Forwarded to| G[OpenTelemetry Collector]
    F -->|Forwarded to| G
Loading

This then feeds into Jaeger and Datadog to inspect.

Observations

1. Functional

  • tracing-opentelemetry: doesn't emit logs, but rather span events. These are being deprecated
  • tracing-opentelemetry: The span events it emits end up looking "malformed" to Datadog. I have no idea where this is happening along the way, but it's probably not such a big deal as a future implementation shouldn't use span events anyhow, and it looks fine in Jaeger, so it must be further along the chain.
  • opentelemetry-appender-tracing: Logs do not appear to correlate with traces; there's no correlation information on them.
  • general: If you have things that are using the logs crate directly - rather than tracing - it's not clear to me how this can be seamlessly enriched with correlation information. The water is further muddied as tracing provides feature flags log and log-always which will push tracing-produced events back out through the log crate, which will get weird when a separate effort is made to ensure log create messages end up in OTEL.

Usability

  • tracing-opentelemetry: Straight off the bat, the "close but not the same" versioning makes it easy to go down a slight-mismatch rabbit hole; looks like i'm not the only one. Similarly the difference between the opentelemetry-rust tags here in GitHub and what is released in Cargo threw me off.
  • general: At the moment it feels like you should combine both of the subscribers like I have, but this doesn't really work - due to the lack of correlation of log messages, and duplication of all logs as both uncorrelated log messages and span events
  • general: Overall docs in general feel very incomplete and it's easy to waste a lot of time on something that (I think) should be very boilerplate - "put all my tracing data into OTEL idiomatically, thanks". Things like "here's how to filter log level" should be IMHO prominently documented in the getting started

Conclusion

Based on this, I reckon:

  • A single crate should provide a bridge for both traces and logs into OTEL. It should not use span events
  • If this was released from this monorepo, it would make the versioning much simpler / less error prone
  • Some effort should be sunk into a complete example of this showing the important levers
  • The two existing crates tracing-opentelemetry and opentelemetry-appender-tracing could then be marked deprecated
  • Making spans work whether using tracing or OTEL to manipulate them in the same app is probably going to be hard. I wonder if that's a hard requirement?

I suspect this is a rather long-winded way of ending up on the same page as everyone else on the thread, but I hope this can serve as a good summary :)

Finally - enormous thanks to @julianocosta89 for entertaining my unending OTEL questions these last 2 weeks ❤️

@lalitb
Copy link
Member

lalitb commented Nov 22, 2024

Thanks for your findings @scottgerring . Just a comment on single crate:

A single crate should provide a bridge for both traces and logs into OTEL. It should not use span events

If the motivation for having a single subscriber/bridge for traces and logs is to achieve correlation between logs and traces, I believe this may not be absolutely necessary. This was explored in #1394, specifically comment here - #1394 (comment). One potential constraint is that both subscribers would need to depend on the same version of otel and otel-sdk. However, this limitation could be addressed once OpenTelemetry reaches a stable release.

That said, having a single subscriber would indeed be a good or ideal solution. However, I wonder if it might introduce any additional burden for high-performance logging needs that do not require span context or correlation. ( I may be incorrect, so need to investigate further).

Also, there is plan to make opentelemetry-appender-tracing RC/stable soon. So, may have to reconsider the plan if moving to single subscriber model.

@cijothomas
Copy link
Member Author

It should not use span events

Lets not make that conclusion. It should be up-to users to decide if they prefer SpanEvents or Logs (or both).
(I understand there were/are proposals to remove SpanEvents, but no such decision has been made, as it is part of a stable spec)

@cijothomas
Copy link
Member Author

@scottgerring Thank you for your analysis. One thing is not clear to me - which of the option listed are you suggesting? 2 or 3?

@jpramosi
Copy link

I think the decision to go with option 2 or 3 should be left to the maintainers and contributors. The first and last options are probably not feasible or likely to happen.
I guess choosing the second option would be probably make the project easier to maintain. If the tokio tracing team is willing to make some changes for Opentelemetry, it should be no problem, except for the users who have to refactor their code. Personally, I don't mind as long as there is a good migration guide.
Option 3 would be ideal, but perhaps maintaining both APIs could be burdensome for the Opentelemetry team in the long run.

@scottgerring
Copy link
Contributor

If the motivation for having a single subscriber/bridge for traces and logs is to achieve correlation between logs and traces, I believe this may not be absolutely necessary. This was explored in #1394, specifically comment here - #1394 (comment). One potential constraint is that both subscribers would need to depend on the same version of otel and otel-sdk

This is a great point! It also seems reasonable to expect that many users may use different exporters for different signals, which may place different version constraints on the otel-sdk pieces.

Lets not make that conclusion. It should be up-to users to decide if they prefer SpanEvents or Logs (or both).
👍

For what it's worth, my opinions on options - 1 and 4 seem like non-starters for reasons that have been extensively enumerated.

Two - the gap is big as seen in @TommyCpp's message above. I think tracing has a much smaller remit - "in process tracing" primarily (I just discovered tracing-distributed but it's not been touched for 3 years, so ...) . I would expect this would be a bit weird usability wise.

Three - I think this is the realistic option. There will be overlap, but it would support the existing community where they are, and allow new users to choose to use OTEL directly if they wish, without carrying an extra layer of indirection, via tracing. What would this mean in practice? I can see:

  • Adding another crate in this monorepo for collecting traces from tracing. This'd subsume tokio tracing's OTEL subscriber, but provide more direct control to help resolve issues like dual-library span and log correlation. I guess it would also continue to support span event's - @cijothomas ?
  • Get the existing tracing log subscriber here to emit correlation data. I think this is dependent on owning the "tracing subscriber" here too (??)

To be clear - I don't think my opinion should carry much weight - i'm very new to this long running discussion, and am coming at this purely from "it'd be cool if this played nicely together" perspective. I am happy to contribute in whatever way is useful to push whichever solution is chosen along.

In the meantime - have a good weekend all!

@cijothomas
Copy link
Member Author

continue to support span event's - @cijothomas ?

Whether tracing events get mapped to SpanEvents or LogRecord - this is something I am not worried at all. We can provide a switch and let user decide which one they prefer - more than the instrumentation side, this choice is often determined by backends - some backends like Jaeger can show SpanEvents, but they don't support LogRecord.

@cijothomas
Copy link
Member Author

@scottgerring

One thing (In my view, this is likely the biggest thing to check out!) that we need to explore further is the Context part. tracing has Registry and own ways of propagating in-proc context, but OTel has its own way of doing this - without some consolidation here, it'll make it very difficult to support features like Baggage, Exemplars in the future.

For example, a user at the beginning of an incoming request puts UserId into baggage. Then tracing is used to start a span. Then an async code is executed. Within that, tracing's span is known to tracing as the current span, but unless user does something extra, OTel Baggage won't flow into there. Now, if user does histogram.record().. inside the async code - the metrics sdk need to find the current context to get traceid and baggage - if it checks tracing's context - it'll get tracing spans traceid. if it checks OTel context, it'll get Baggage..... This is something we need to spend some time exploring in detail to ensure the direction we take will not make future scenarios broken.
(Note: Exemplars are not yet implemented here. Also Baggage's W3C status is only CandidateRecommendation today, but expecting to reach Recommendation state soon, and we'll see a lot more user cases with Baggage.)

Another Context related thing - is the suppression flag #1330 has made some attempts and has listed few challenges - I think the root issue there also is the lack of common agreement of what is the Context as we have tracing context and OTel context.

@scottgerring
Copy link
Contributor

scottgerring commented Nov 25, 2024

@cijothomas Understood - and thanks for taking the time to detail this. I'll try jump on the SIG call tomorrow for a quick chat and to see if there's any way I can make myself useful here.

@bantonsson
Copy link
Contributor

@cijothomas I've been talking a bit with @scottgerring about his woes when building the sample application, and my own when building a different sample. This has led me to dig around a lot in the current bridge implementations and various POCs that are referenced here.

I think that the missmatch between the goals of OpenTelemetry and Tokio Tracing is too big for one of them to replace the other in any straight forward way, so option 3 with great interop between the APIs seems the logical path forward.

The idea that I've been toying with (not unlike some of the proposals here) is to treat Tokio Tracing as two separate things:

  1. Context Propagation: This ensures that the current OpenTelemetry Context flows with the Tokio Tracing spans (this is what would be needed for getting the trace/span id into logs if needed)
  2. Trace Generation: Generate OpenTelemetry traces/spans from Tokio Tracing spans, and vice versa.

To enable this I propose something similar to the Context Storage in Java. This would allow a bridge to implement a ContextStorage mechanism that uses the storage in Tokio Tracing to access and activate the current context, that could then be used by the normal OpenTelemetry APIs.

I've not fully figured out all the constraints on OpenTelemetry contexts within Tokio Tracing spans (will they be closed with the span they were created in etc.), but I'm planning to spend some time to prototype this.

Also, this would allow the complete separation of the OpenTelemetry tracing logging appender, and the OpenTelemetry tracing tracing bridge (is that the correct name?), and you could pick and choose to enable context propagation if necessary.

Let's discuss in the SIG call tomorrow.

@lalitb
Copy link
Member

lalitb commented Nov 25, 2024

To enable this I propose something similar to the Context Storage in Java. This would allow a bridge to implement a ContextStorage mechanism that uses the storage in Tokio Tracing to access and activate the current context, that could then be used by the normal OpenTelemetry APIs.

+1 to providing support to use/plugin external Context, irrespective of the option we go with. This is also somewhat specified in the specs:

Languages are expected to use the single, widely used Context implementation if one exists for them. In the cases where an extremely clear, pre-existing option is not available, OpenTelemetry MUST provide its own Context implementation. Depending on the language, its usage may be either explicit or implicit.

Also, opentelemetry-cpp has similar implementation - https://github.com/open-telemetry/opentelemetry-cpp/blob/31956f82ff990a870d2953c722666a782f672c35/api/include/opentelemetry/context/runtime_context.h#L155. It implements thread-local Context as default, but allow users to bring their own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.
Projects
None yet
Development

No branches or pull requests