Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize Server-Timing: traceparent "propagator" across vendors #3811

Open
johnbley opened this issue Jan 10, 2024 · 22 comments
Open

Standardize Server-Timing: traceparent "propagator" across vendors #3811

johnbley opened this issue Jan 10, 2024 · 22 comments
Assignees
Labels
spec:context Related to the specification/context directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned

Comments

@johnbley
Copy link
Member

What are you trying to achieve?

Multiple otel vendors have used HTTP Server-Timing headers to propagate
server-side instrumentation context back to client instrumentation. I
would like the otel specification to canonicalize the names, formats,
and configuration options for this, and for the various otel implementations
to accept donated implementations of this concept.

Additional context.

Client-side instrumentation (in the sense of web or mobile apps) may set outbound context
via http headers which may be received by server-side instrumentation. However, there are a few
cases where this breaks down:

  • users that want to keep a trust boundary between these domains and don't want
    untrusted clients to influence the way their server-side instrumentation behaves
  • initial page loads and resource loads in browsers (where javascript instrumentation
    can't influence)
  • users that don't want the added complexity/overhead of CORS preflight from browsers
    caused by adding headers to fetch/xhr requests

Multiple otel vendors have landed on a solution to the second point above, by using Server-Timing
response headers generated by server-side instrumentation and received by client-side instrumentation.

A few links for your reference:

Server-Timing response headers are keyed to a name (conceptually it could be used like
"app=400, db=300, env=prod3"). Several otel vendors/contributors have indepdently used this in
the fairly obvious way, where the key used is traceparent and the value is the full traceparent-format
string. A complete example would be:

Server-Timing: traceparent;desc="00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"

Some existing examples from around the otel universe:

Each one uses the exact same "propagation" concept (traceparent value format, traceparent is the key name in
the server-timing header). They do differ in configuration/setup, and they also differ in client usage -
for example, Microsoft's product (to my knowledge) uses it on browser page load to set the actual trace context
for the page load, while Splunk clients add the server-side trace context as a trace link to the appropriate
client-side http client span. In my opinion the specifciation can sidestep this issue (directed usage of the propagated
context) for now, or recommend making it configurable.

Questions towards a specification

  • What is this? A "propagator"? In this sense a user could set their configured propagators for
    server-side instrumentation to be tracecontext,baggage,servertiming and the servertiming propagator
    would propagate back to the client?
  • If it's not a propagator, how would it fit into the spec and how would it be
    configured on/off (e.g., environment variables)?
@johnbley johnbley added the spec:context Related to the specification/context directory label Jan 10, 2024
@t2t2
Copy link

t2t2 commented Jan 10, 2024

FYI @cedricziel as you expressed interest in writing a proposal for this in slack before

@mmanciop
Copy link

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

@johnbley
Copy link
Member Author

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes. 😞

@cedricziel
Copy link

Seconding @mmanciop here. We would love to see a sustainable and supported way to communicate server context to client side technology.

Server-Timing is widely used even beyond the implementations mentioned and I think OTel would benefit a lot from a specification of using it for the purpose of forwarding context to clients.

@jpkrohling
Copy link
Member

Talk about timing! I was just discussing this yesterday with a few other folks and even opened this here:
w3c/trace-context#556

Here are a few notes based on my investigation so far:

  • we might want to use traceresponse instead of traceparent, which is defined in a draft of Trace Context. It's almost the same payload, except that it uses child-id instead of parent-id in one of the fields
  • on the OTel SDK side, we might need to split the notion of propagators that are sending data forward (request propagators), and propagators that are sending data back to the callers (response propagators). Otherwise, we'll send traceparent back to clients and traceresponse to servers. I'm working on a draft proposal for this and will be opening an OTEP soon.
  • a second OTEP would be a change to the SDK spec, so that all SDKs can implement a "backward propagator" (response propagator) for Server-Timing + traceresponse.
  • I believe that the W3C Trace Context WG should revisit the decision about the header name for traceresponse before stabilizing the draft, favoring the use of Server-Timing with a metric name "traceresponse" instead of a new "traceresponse" header. I gathered some evidence of usages of Server-Timing for our purposes in the linked W3C Trace Context issue, and I'm happy to see that @johnbley's research corroborates with it.

@johnbley
Copy link
Member Author

I like all of what @jpkrohling has to say. I like the idea of using Server-Timing: traceresponse since it is semantically clearer (again, if the client wants to use this as a parent, more power to it). I also really like the idea of a spec-level differentiation between request propagators and response propagators (though maybe not a config-level differentiation - we have enough environment variables as it is). It seems like it would enable other areas of innovation around sampling approaches or overall coordination among instrumentation code.

@mmanciop
Copy link

mmanciop commented Jan 12, 2024

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes.

To explain the impact to others landing here: ServerTiming is the only known way (AFAICT) in the RUM industry to reliably correlate document (page) loads or resource (pre)fetches with distributed traces, as the distributed tracer can inject the ServerTiming header in outgoing responses with values that represent the active trace context. And Safari is not playing ball as of Jan 2024 :-)

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

@jpkrohling
Copy link
Member

jpkrohling commented Jan 15, 2024

Great point, @mmanciop. I was having problems understanding why we couldn't do it with our current solutions until @cedricziel showed me this diagram he created:

image

@jpkrohling
Copy link
Member

jpkrohling commented Jan 15, 2024

As mentioned in a previous comment, I was getting ready to propose a spec change related to this, and here's the draft I had. Note that I was breaking down the task in smaller chunks, the first one being expanding the notion of propagators so that we define what's a "client propagator". The next one, based on the outcome of the W3C Trace Context issue I listed earlier, would be to define the first client propagator based on traceresponse (either its own header, or as a metric of Server-Timing).


When working with client-side instrumentation, such as the ones being developed under the Client Instrumentation SIG, there’s currently no reliable way to obtain the trace context or any references to the trace generated by the backend during the initial document request on the client. While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span), other scenarios might still be hard to implement. For instance, the response of a backend might cause a re-render of a UI component, and currently, it’s not possible to link the trace related to that re-render to the root span of the backend trace unless the trace ID has been created by the frontend and reused in the backend.

This spec change proposal enhances the concept of propagators to differentiate between “backward propagators” (or response propagators) and “forward propagators” (or request propagators):

  • Our current propagators are what’s then going to be called “forward propagator”, given that they are intended to propagate the context to the next steps in the call chain. This is typically added to request headers, but we want to avoid the term “request”, as this is not bound to HTTP requests.
  • Backward propagators are a new concept, propagating the context back to the caller, to the previous step in the call chain. This is typically added to response headers in an HTTP scenario.

Without this differentiation, when implementing context propagation to clients, it would result in headers being sent in the request and response payloads that are not intended to be there, which might cause ambiguity, conflicts, and increased payload size. For example, an application configured with TraceContext propagator and a new hypothetical ClientPropagator might end up sending the following headers to all their outgoing requests to downstream services and to their responses to callers:

Server-Timing: traceresponse;desc=00-123-456-01
traceparent: 00-123-789-01

This spec change is agnostic to the payload and relates only to enhancing the definition of propagators. The payload that would be used in the first recommended backward propagator is still under definition by the W3C Trace Context working group and is, therefore, out of scope for this change.

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 15, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@MSNev
Copy link
Contributor

MSNev commented Jan 16, 2024

Yes, we should support passing this via the Server-Timing headers for browsers that support it.

@bripkens
Copy link

About

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

and

While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span)

Note that the addition of custom request headers in XHR/fetch instrumentations is prone to cause same-origin policy issues. This can be worked around using CORS, but this causes significant friction and is commonly misunderstood.

Ideally, a correlation solution does not have to (solely) rely on additional request headers. Server-Timing makes this more reliable and easier to deploy for users.

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 23, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@dyladan
Copy link
Member

dyladan commented Jan 30, 2024

The W3C distributed tracing working group met with the Web Performance Working Group about exactly this today. Notes are in the tracking issue created by @jpkrohling w3c/trace-context#556

The short version is that we are encouraged by the possibility of using server timing. The group had previously decided to define a custom header purely because server-timing was nascent, but the landscape is significantly improved now. The next step is for the tracing working group to translate its existing draft response header spec into a version which uses a server-timing metric.

@dyladan
Copy link
Member

dyladan commented Jan 30, 2024

biggest concern currently with server-timing is browser support. It is not available in safari or iOS currently, and according to https://caniuse.com/server-timing it is available for about 75% of users. After discussion with the web performance group, it seems that safari support is held back due to privacy concerns and is likely to be restricted to a same-origin policy regardless of CORS opt-in or timing-allow-origin.

@jpkrohling
Copy link
Member

jpkrohling commented Jan 31, 2024

@dyladan, what I understood from w3c/trace-context#556 regarding this last concern is that we'd face the same challenges with Safari, so, we'd be in no better position if trace context would decide to have its own response header, right ?

@dyladan
Copy link
Member

dyladan commented Jan 31, 2024

Now, yes. In 2018 when the question was first considered the answer was less clear. I wasn't sharing it as a reason not to use server timing, just trying to make sure everyone was aware of the limitations.

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 31, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 13, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 14, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@scheler
Copy link
Contributor

scheler commented Feb 21, 2024

@jpkrohling @johnbley The current traceresponse spec only enables the client to add a span link to the server side span, it does not enable the typical parent-child relationship of spans in a trace. Are you proposing that the spec change should enable both? If so, then it's the traceresponse spec that needs to enable both these scenarios (perhaps a trace-flag could indicate whether the span id provided is a child-id or parent-id). On the other hand, if you are suggesting that we do only span linking then that's fine too and would be simpler overall, as more options is more complexity.

@jpkrohling jpkrohling self-assigned this Feb 22, 2024
jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 22, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@martinkuba
Copy link
Contributor

For completeness, there is another way to propagate the context back to the client for web document load, and that is by writing a meta tag in the HTML content. This is currently implemented in the OTel document-load instrumentation. This at least has the advantage of working on Safari as well.

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Mar 11, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Apr 4, 2024
Relates to open-telemetry#3811

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@tigrannajaryan tigrannajaryan added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:needs-info Not enough information. Left open to provide the author with time to add more details labels Apr 19, 2024
@tigrannajaryan
Copy link
Member

This seems non-trivial enough to need an OTEP with more details.

@jmacd
Copy link
Contributor

jmacd commented Apr 19, 2024

I think this is the status: #3825 (comment)

@jack-berg jack-berg added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:needs-info Not enough information. Left open to provide the author with time to add more details labels Apr 23, 2024
@jack-berg
Copy link
Member

Discussed in the 4/23/24 Spec SIG. Given that the problem is very related to browsers, it might be appropriate for the client SIG to work on this. I've set the status to triage:accepted:needs-sponsor, but can update if the client SIG wants to take this on.

@jpkrohling
Copy link
Member

Given that I opened a PR for this already (#3825), I'm OK being the sponsor.

@danielgblanco
Copy link
Contributor

Maybe out of scope for this issue (or the PR above raised by @jpkrohling ) but if response propagators were to be configured to propagate context back to callers, would this be a good use of tracestate in response headers, to propagate a low-cardinality http.route that can be used not only by browser clients, but also by HTTP proxies, or in fact any client, to apply a more informative name both on client (and server in case of proxies) spans?

@austinlparker austinlparker moved this to Spec - Accepted in 🔭 Main Backlog Jul 16, 2024
@austinlparker austinlparker added triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned and removed triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor labels Jul 16, 2024
@austinlparker austinlparker moved this from Spec - Accepted to Spec - In Progress in 🔭 Main Backlog Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:context Related to the specification/context directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned
Projects
Status: Spec - In Progress
Development

No branches or pull requests