Standardize Server-Timing: traceparent "propagator" across vendors #3811

johnbley · 2024-01-10T18:02:16Z

What are you trying to achieve?

Multiple otel vendors have used HTTP Server-Timing headers to propagate
server-side instrumentation context back to client instrumentation. I
would like the otel specification to canonicalize the names, formats,
and configuration options for this, and for the various otel implementations
to accept donated implementations of this concept.

Additional context.

Client-side instrumentation (in the sense of web or mobile apps) may set outbound context
via http headers which may be received by server-side instrumentation. However, there are a few
cases where this breaks down:

users that want to keep a trust boundary between these domains and don't want
untrusted clients to influence the way their server-side instrumentation behaves
initial page loads and resource loads in browsers (where javascript instrumentation
can't influence)
users that don't want the added complexity/overhead of CORS preflight from browsers
caused by adding headers to fetch/xhr requests

Multiple otel vendors have landed on a solution to the second point above, by using Server-Timing
response headers generated by server-side instrumentation and received by client-side instrumentation.

A few links for your reference:

https://www.w3.org/TR/server-timing/ which is the spec for the server-timing header
https://caniuse.com/?search=server-timing (showing that this header is well-supported by browsers)
https://www.w3.org/TR/trace-context/#traceparent-header (the otel default for propagation)

Server-Timing response headers are keyed to a name (conceptually it could be used like
"app=400, db=300, env=prod3"). Several otel vendors/contributors have indepdently used this in
the fairly obvious way, where the key used is traceparent and the value is the full traceparent-format
string. A complete example would be:

Server-Timing: traceparent;desc="00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"

Some existing examples from around the otel universe:

Grafana donated PHP contrib code treating the server-timing header as a "propagator": https://github.com/open-telemetry/opentelemetry-php-contrib/tree/main/src/Propagation/ServerTiming
Splunk code from our java otel distribution: https://github.com/signalfx/splunk-otel-java/blob/c73b94575488458c1d267af3514fb0db25e48935/custom/src/main/java/com/splunk/opentelemetry/servertiming/ServerTimingHeaderCustomizer.java#L45
Microsoft client-side code looking for the header: https://github.com/microsoft/ApplicationInsights-JS/blob/7f804d81e3036d5115c0c8e859dec5c4ce08b269/shared/AppInsightsCore/src/JavaScriptSDK/W3cTraceParent.ts#L191-L204

Each one uses the exact same "propagation" concept (traceparent value format, traceparent is the key name in
the server-timing header). They do differ in configuration/setup, and they also differ in client usage -
for example, Microsoft's product (to my knowledge) uses it on browser page load to set the actual trace context
for the page load, while Splunk clients add the server-side trace context as a trace link to the appropriate
client-side http client span. In my opinion the specifciation can sidestep this issue (directed usage of the propagated
context) for now, or recommend making it configurable.

Questions towards a specification

What is this? A "propagator"? In this sense a user could set their configured propagators for
server-side instrumentation to be tracecontext,baggage,servertiming and the servertiming propagator
would propagate back to the client?
If it's not a propagator, how would it fit into the spec and how would it be
configured on/off (e.g., environment variables)?

The text was updated successfully, but these errors were encountered:

t2t2 · 2024-01-10T18:11:16Z

FYI @cedricziel as you expressed interest in writing a proposal for this in slack before

mmanciop · 2024-01-10T18:44:07Z

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

johnbley · 2024-01-10T20:03:18Z

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes. 😞

cedricziel · 2024-01-10T20:52:25Z

Seconding @mmanciop here. We would love to see a sustainable and supported way to communicate server context to client side technology.

Server-Timing is widely used even beyond the implementations mentioned and I think OTel would benefit a lot from a specification of using it for the purpose of forwarding context to clients.

jpkrohling · 2024-01-11T08:38:48Z

Talk about timing! I was just discussing this yesterday with a few other folks and even opened this here:
w3c/trace-context#556

Here are a few notes based on my investigation so far:

we might want to use traceresponse instead of traceparent, which is defined in a draft of Trace Context. It's almost the same payload, except that it uses child-id instead of parent-id in one of the fields
on the OTel SDK side, we might need to split the notion of propagators that are sending data forward (request propagators), and propagators that are sending data back to the callers (response propagators). Otherwise, we'll send traceparent back to clients and traceresponse to servers. I'm working on a draft proposal for this and will be opening an OTEP soon.
a second OTEP would be a change to the SDK spec, so that all SDKs can implement a "backward propagator" (response propagator) for Server-Timing + traceresponse.
I believe that the W3C Trace Context WG should revisit the decision about the header name for traceresponse before stabilizing the draft, favoring the use of Server-Timing with a metric name "traceresponse" instead of a new "traceresponse" header. I gathered some evidence of usages of Server-Timing for our purposes in the linked W3C Trace Context issue, and I'm happy to see that @johnbley's research corroborates with it.

johnbley · 2024-01-11T16:47:31Z

I like all of what @jpkrohling has to say. I like the idea of using Server-Timing: traceresponse since it is semantically clearer (again, if the client wants to use this as a parent, more power to it). I also really like the idea of a spec-level differentiation between request propagators and response propagators (though maybe not a config-level differentiation - we have enough environment variables as it is). It seems like it would enable other areas of innovation around sampling approaches or overall coordination among instrumentation code.

mmanciop · 2024-01-12T14:27:55Z

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes.

To explain the impact to others landing here: ServerTiming is the only known way (AFAICT) in the RUM industry to reliably correlate document (page) loads or resource (pre)fetches with distributed traces, as the distributed tracer can inject the ServerTiming header in outgoing responses with values that represent the active trace context. And Safari is not playing ball as of Jan 2024 :-)

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

jpkrohling · 2024-01-15T09:46:43Z

Great point, @mmanciop. I was having problems understanding why we couldn't do it with our current solutions until @cedricziel showed me this diagram he created:

jpkrohling · 2024-01-15T13:18:39Z

As mentioned in a previous comment, I was getting ready to propose a spec change related to this, and here's the draft I had. Note that I was breaking down the task in smaller chunks, the first one being expanding the notion of propagators so that we define what's a "client propagator". The next one, based on the outcome of the W3C Trace Context issue I listed earlier, would be to define the first client propagator based on traceresponse (either its own header, or as a metric of Server-Timing).

When working with client-side instrumentation, such as the ones being developed under the Client Instrumentation SIG, there’s currently no reliable way to obtain the trace context or any references to the trace generated by the backend during the initial document request on the client. While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span), other scenarios might still be hard to implement. For instance, the response of a backend might cause a re-render of a UI component, and currently, it’s not possible to link the trace related to that re-render to the root span of the backend trace unless the trace ID has been created by the frontend and reused in the backend.

This spec change proposal enhances the concept of propagators to differentiate between “backward propagators” (or response propagators) and “forward propagators” (or request propagators):

Our current propagators are what’s then going to be called “forward propagator”, given that they are intended to propagate the context to the next steps in the call chain. This is typically added to request headers, but we want to avoid the term “request”, as this is not bound to HTTP requests.
Backward propagators are a new concept, propagating the context back to the caller, to the previous step in the call chain. This is typically added to response headers in an HTTP scenario.

Without this differentiation, when implementing context propagation to clients, it would result in headers being sent in the request and response payloads that are not intended to be there, which might cause ambiguity, conflicts, and increased payload size. For example, an application configured with TraceContext propagator and a new hypothetical ClientPropagator might end up sending the following headers to all their outgoing requests to downstream services and to their responses to callers:

Server-Timing: traceresponse;desc=00-123-456-01
traceparent: 00-123-789-01

This spec change is agnostic to the payload and relates only to enhancing the definition of propagators. The payload that would be used in the first recommended backward propagator is still under definition by the W3C Trace Context working group and is, therefore, out of scope for this change.

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

MSNev · 2024-01-16T17:33:42Z

Yes, we should support passing this via the Server-Timing headers for browsers that support it.

bripkens · 2024-01-18T14:36:55Z

About

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

and

While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span)

Note that the addition of custom request headers in XHR/fetch instrumentations is prone to cause same-origin policy issues. This can be worked around using CORS, but this causes significant friction and is commonly misunderstood.

Ideally, a correlation solution does not have to (solely) rely on additional request headers. Server-Timing makes this more reliable and easier to deploy for users.

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

dyladan · 2024-01-30T21:06:25Z

The W3C distributed tracing working group met with the Web Performance Working Group about exactly this today. Notes are in the tracking issue created by @jpkrohling w3c/trace-context#556

The short version is that we are encouraged by the possibility of using server timing. The group had previously decided to define a custom header purely because server-timing was nascent, but the landscape is significantly improved now. The next step is for the tracing working group to translate its existing draft response header spec into a version which uses a server-timing metric.

dyladan · 2024-01-30T21:08:23Z

biggest concern currently with server-timing is browser support. It is not available in safari or iOS currently, and according to https://caniuse.com/server-timing it is available for about 75% of users. After discussion with the web performance group, it seems that safari support is held back due to privacy concerns and is likely to be restricted to a same-origin policy regardless of CORS opt-in or timing-allow-origin.

jpkrohling · 2024-01-31T12:40:08Z

@dyladan, what I understood from w3c/trace-context#556 regarding this last concern is that we'd face the same challenges with Safari, so, we'd be in no better position if trace context would decide to have its own response header, right ?

dyladan · 2024-01-31T12:43:09Z

Now, yes. In 2018 when the question was first considered the answer was less clear. I wasn't sharing it as a reason not to use server timing, just trying to make sure everyone was aware of the limitations.

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

scheler · 2024-02-21T01:34:38Z

@jpkrohling @johnbley The current traceresponse spec only enables the client to add a span link to the server side span, it does not enable the typical parent-child relationship of spans in a trace. Are you proposing that the spec change should enable both? If so, then it's the traceresponse spec that needs to enable both these scenarios (perhaps a trace-flag could indicate whether the span id provided is a child-id or parent-id). On the other hand, if you are suggesting that we do only span linking then that's fine too and would be simpler overall, as more options is more complexity.

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

martinkuba · 2024-02-23T00:02:12Z

For completeness, there is another way to propagate the context back to the client for web document load, and that is by writing a meta tag in the HTML content. This is currently implemented in the OTel document-load instrumentation. This at least has the advantage of working on Safari as well.

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

tigrannajaryan · 2024-04-19T18:05:27Z

This seems non-trivial enough to need an OTEP with more details.

jmacd · 2024-04-19T18:13:21Z

I think this is the status: #3825 (comment)

jack-berg · 2024-04-23T15:35:20Z

Discussed in the 4/23/24 Spec SIG. Given that the problem is very related to browsers, it might be appropriate for the client SIG to work on this. I've set the status to triage:accepted:needs-sponsor, but can update if the client SIG wants to take this on.

jpkrohling · 2024-04-23T15:47:58Z

Given that I opened a PR for this already (#3825), I'm OK being the sponsor.

danielgblanco · 2024-06-17T11:08:43Z

Maybe out of scope for this issue (or the PR above raised by @jpkrohling ) but if response propagators were to be configured to propagate context back to callers, would this be a good use of tracestate in response headers, to propagate a low-cardinality http.route that can be used not only by browser clients, but also by HTTP proxies, or in fact any client, to apply a more informative name both on client (and server in case of proxies) spans?

johnbley added the spec:context Related to the specification/context directory label Jan 10, 2024

github-actions bot assigned jmacd Jan 10, 2024

basti1302 mentioned this issue Jan 11, 2024

Revisit header name -- Server-Timing vs. traceresponse w3c/trace-context#556

Open

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 15, 2024

Context propagation to client instrumentation

457a27b

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling mentioned this issue Jan 15, 2024

Context propagation to client instrumentation #3825

Closed

5 tasks

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 23, 2024

Context propagation to client instrumentation

c4d8c76

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Jan 31, 2024

Context propagation to client instrumentation

fc3b214

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 13, 2024

Context propagation to client instrumentation

612e2de

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 14, 2024

Context propagation to client instrumentation

8dacc0f

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling self-assigned this Feb 22, 2024

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Feb 22, 2024

Context propagation to client instrumentation

691bd09

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Mar 11, 2024

Context propagation to client instrumentation

47940da

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

basti1302 mentioned this issue Mar 12, 2024

Use server-timing for trace context response w3c/trace-context#560

Draft

jpkrohling added a commit to jpkrohling/opentelemetry-specification that referenced this issue Apr 4, 2024

Context propagation to client instrumentation

29e2422

Relates to open-telemetry#3811 Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

meastp mentioned this issue Apr 10, 2024

API for response headers #1355

Open

tigrannajaryan added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:needs-info Not enough information. Left open to provide the author with time to add more details labels Apr 19, 2024

martinkuba added this to Spec: Client SDK and Instrumentation Apr 23, 2024

martinkuba moved this to Backlog in Spec: Client SDK and Instrumentation Apr 23, 2024

dyladan mentioned this issue May 21, 2024

Best way to to support traceresponse and load balancer deferred sampling? #2914

Closed

danielgblanco mentioned this issue Jun 17, 2024

Use client/server span name as x-envoy-decorator-operation request/response header open-telemetry/opentelemetry-java-instrumentation#11376

Closed

austinlparker added this to 🔭 Main Backlog Jul 16, 2024

austinlparker moved this to Spec - Accepted in 🔭 Main Backlog Jul 16, 2024

austinlparker added triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned and removed triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor labels Jul 16, 2024

austinlparker unassigned jmacd Jul 16, 2024

austinlparker moved this from Spec - Accepted to Spec - In Progress in 🔭 Main Backlog Jul 16, 2024

MSNev mentioned this issue Jul 24, 2024

feat(user-interaction): adding traceparent context from meta tag open-telemetry/opentelemetry-js-contrib#1850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize Server-Timing: traceparent "propagator" across vendors #3811

Standardize Server-Timing: traceparent "propagator" across vendors #3811

johnbley commented Jan 10, 2024

t2t2 commented Jan 10, 2024

mmanciop commented Jan 10, 2024

johnbley commented Jan 10, 2024

cedricziel commented Jan 10, 2024

jpkrohling commented Jan 11, 2024

johnbley commented Jan 11, 2024

mmanciop commented Jan 12, 2024 •

edited

Loading

jpkrohling commented Jan 15, 2024 •

edited

Loading

jpkrohling commented Jan 15, 2024 •

edited

Loading

MSNev commented Jan 16, 2024

bripkens commented Jan 18, 2024

dyladan commented Jan 30, 2024

dyladan commented Jan 30, 2024

jpkrohling commented Jan 31, 2024 •

edited

Loading

dyladan commented Jan 31, 2024 •

edited

Loading

scheler commented Feb 21, 2024

martinkuba commented Feb 23, 2024

tigrannajaryan commented Apr 19, 2024

jmacd commented Apr 19, 2024

jack-berg commented Apr 23, 2024

jpkrohling commented Apr 23, 2024

danielgblanco commented Jun 17, 2024

Standardize Server-Timing: traceparent "propagator" across vendors #3811

Standardize Server-Timing: traceparent "propagator" across vendors #3811

Comments

johnbley commented Jan 10, 2024

t2t2 commented Jan 10, 2024

mmanciop commented Jan 10, 2024

johnbley commented Jan 10, 2024

cedricziel commented Jan 10, 2024

jpkrohling commented Jan 11, 2024

johnbley commented Jan 11, 2024

mmanciop commented Jan 12, 2024 • edited Loading

jpkrohling commented Jan 15, 2024 • edited Loading

jpkrohling commented Jan 15, 2024 • edited Loading

MSNev commented Jan 16, 2024

bripkens commented Jan 18, 2024

dyladan commented Jan 30, 2024

dyladan commented Jan 30, 2024

jpkrohling commented Jan 31, 2024 • edited Loading

dyladan commented Jan 31, 2024 • edited Loading

scheler commented Feb 21, 2024

martinkuba commented Feb 23, 2024

tigrannajaryan commented Apr 19, 2024

jmacd commented Apr 19, 2024

jack-berg commented Apr 23, 2024

jpkrohling commented Apr 23, 2024

danielgblanco commented Jun 17, 2024

mmanciop commented Jan 12, 2024 •

edited

Loading

jpkrohling commented Jan 15, 2024 •

edited

Loading

jpkrohling commented Jan 15, 2024 •

edited

Loading

jpkrohling commented Jan 31, 2024 •

edited

Loading

dyladan commented Jan 31, 2024 •

edited

Loading