Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext (take 2) #277

Closed

Conversation

Aneurysm9
Copy link
Member

This presents another alternative to #263 and #272. This is an initial effort to codify the conclusion reached by the FaaS SIG in January.

@Aneurysm9 Aneurysm9 requested review from a team August 21, 2023 21:36
…SpanContext`

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@trask
Copy link
Member

trask commented Aug 21, 2023

@Aneurysm9 thanks for giving this another take

Is my understanding of this change correct given Scenario 1 @jsuereth described in #272 (comment)?

  • If both Service A and Service B are using the default W3C propagator, then the Service B span will parent the Service A span.
  • If Service B opts into the xray propagator (either programmatically or using the OTEL_PROPAGATORS env var), then the Service B span will parent the vended span.

And one other question (assuming I understood the above correctly):

  • Is this what the go lambda instrumentation does already? (I think I had seen that neither X-Ray nor other vendors had an issue with the current state of the go lambda instrumentation and wondering if this is how?)

@Aneurysm9
Copy link
Member Author

Is my understanding of this change correct given Scenario 1 @jsuereth described in #272 (comment)?

  • If both Service A and Service B are using the default W3C propagator, then the Service B span will parent the Service A span.
  • If Service B opts into the xray propagator (either programmatically or using the OTEL_PROPAGATORS env var), then the Service B span will parent the vended span.

This aligns with my understanding with the exception that the W3C tracecontext propagator is not the default. The specification requires that a no-op propagator be used by the API by default (and by extension then by instrumentation that uses that API).

And one other question (assuming I understood the above correctly):

  • Is this what the go lambda instrumentation does already? (I think I had seen that neither X-Ray nor other vendors had an issue with the current state of the go lambda instrumentation and wondering if this is how?)

Yes. The Go Lambda instrumentation includes configuration options for specifying a TracerProvider, Propagator, and an EventToCarrier implementation. The Go instrumentation guidelines require that instrumentation accept a TracerProvider and Propagator and utilize the global values if one is not explicitly provided.

A simple EventToCarrier implementation that can produce a carrier containing the value from the Lambda execution environment is included in a separate package to ensure that the instrumentation is usable without any requirement to utilize any AWS-provided values or propagators. That distinction was discussed on open-telemetry/opentelemetry-go-contrib#3428 (comment) with the conclusion by the Go SIG that no changes were required there at this time. Indeed, that change would have made that instrumentation more tightly coupled to the current AWS Lambda service behavior that it presently is while also breaking the ability that AWS X-Ray users had to produce a complete and unbroken trace in X-Ray.

docs/faas/aws-lambda.md Outdated Show resolved Hide resolved
contain an incomplete trace context which indicates X-Ray isn’t enabled. The environment variable will be set and the
`Context` will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can
disable AWS X-Ray for the function if the X-Ray Span Link is not desired.
If the `_X_AMZN_TRACE_ID` environment variable is set, instrumentation MUST ensure that the value from that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually needs prototyping in languages-not-Go. I like the direction of this (pushing precedence configuration to propagation config), but I'm not sure our current implementations will allow this.

WDYT @trask @dyladan for Java / JavaScript?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work. The instrumentation has control over both the carrier and the function that reads fields from that carrier (so it could always create a synthetic carrier / accessor function). @rapphil would you or someone at AWS be able to send a Java draft PR to prove this out?

Copy link
Contributor

@rapphil rapphil Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @trask I had previously created this POC rapphil/opentelemetry-java-instrumentation#1 for this proposal.

I will update it and share the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a prototype but can't imagine that this would be difficult in JS. We have full control of the carrier and it is just a plain javascript object. There might be an additional allocation to ensure the original header object is not mutated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @trask I had previously created this POC rapphil/opentelemetry-java-instrumentation#1 for this proposal.

I will update it and share the details.

when you're ready, go ahead and send it as a draft PR so we can review / add comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth
Copy link
Contributor

jsuereth commented Aug 22, 2023

This aligns with my understanding with the exception that the W3C tracecontext propagator is not the default. The specification requires that a no-op propagator be used by the API by default (and by extension then by instrumentation that uses that API).

Unfortunately that dictates the API, but the SDK has default configuration specified that's in conflcit with that line, see:

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/configuration/sdk-environment-variables.md#general-sdk-configuration

To be more explicit, Our Propagator-API spec says:

The OpenTelemetry API MUST use no-op propagators unless explicitly configured
otherwise.

The original PR for that line states:

Default propagators in un-configured API must be no-op

So, in reality, once you've configured an SDK the default is w3c traceparent + baggage. A noop propagation only happens when the SDK has not been configured.

@Aneurysm9
Copy link
Member Author

Unfortunately that dictates the API, but the SDK has default configuration specified that's in conflcit with that line, see:

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/configuration/sdk-environment-variables.md#general-sdk-configuration

That is an optional part of the specification governing configuration of the SDK through environment variables and is not the case in all SDKs:

SDKs MAY choose to allow configuration via the environment variables in this specification, but are not required to.

So, in reality, once you've configured an SDK the default is w3c traceparent + baggage. A noop propagation only happens when the SDK has not been configured.

I think we generally agree on this, though may disagree on when or whether an SDK should cause a non-no-op global propagator to be configured in the absence of explicit user direction.

The OpenTelemetry API MUST use no-op propagators unless explicitly configured otherwise.

I believe that this line in the API spec requires explicit configuration and a default value being set by an SDK without explicit user configuration is in contravention of that spec. I believe the API spec is controlling here as the propagation API is entirely specified within the API spec and an SDK is a user of that propagation API, but that is not relevant to this discussion.

@jsuereth
Copy link
Contributor

That is an optional part of the specification governing configuration of the SDK through environment variables and is not the case in all SDKs

I believe Go may be the only SDK that hasn't implemented this portion of the specification now, and it's my understanding they're adding this configuration now.

I'm not sure the interpretation of the specification you're taking matches what we see in implementations across SDKs in the environment, nor do I think it represents a consensus view from the community.

docs/faas/aws-lambda.md Outdated Show resolved Hide resolved
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@trask
Copy link
Member

trask commented Aug 22, 2023

@Aneurysm9 Is xray the only propagation mechanism that works when propagating context to a message-driven lambda (making xray propagation essential within AWS infra whether or not you are using X-Ray backend)?

@Aneurysm9
Copy link
Member Author

@Aneurysm9 Is xray the only propagation mechanism that works when propagating context to a message-driven lambda (making xray propagation essential within AWS infra whether or not you are using X-Ray backend)?

I do not believe so. It is the only mechanism presently supported by AWS Lambda within SQS message headers if you want AWS Lambda to participate in the trace. To the extent that instrumentation does not offer user control over the construction of the propagation Carrier then there may be instrumentation that does support the use of the X-Ray propagation format but not others due to the choices they have made regarding Carrier construction. See open-telemetry/opentelemetry-specification#3212 (review) and my comments on that PR that was merged over my objection.

@jsuereth
Copy link
Contributor

We had a discussion in the TC meeting about this issue. I'd like to recap the consensus and direction from that meeting:

  1. We believe that OpenTelemetry should promote, fix and adopt the W3C traceparent specification, making this its default trace propagation framework and ensuring users get the best visibility when using this framework. This includes fixing concerns like @Oberon00's comment here.
  2. We believe users should have flexibility in trace propagation formats and solutions for dealing with proprietary formats, like AWS x-ray or Google Cloud's X-Cloud-Trace-Context. These should be handled via the OpenTelemetry propagator framework and its configuration.
  3. OpenTelemetry's existing propagator framework was designed in the assumption that propagation of information would be orthogonal (e.g. w3c traceparent + w3c baggage) and effectively forces users to have only one active trace parent propagation format active at a time. We believe this needs to be refined, given what we see in real world usage.

Given these fundamental beliefs, we (the technical committee) propose the following:

  • This PR (FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext (take 2) #277) which updated AWS instrumentation guidance to encourage the usage of Propagation API is the right direction forward. The language should be updated so that instrumentation will defer to OTEL's propagation framework and leverage the mechanism of exposing carrier/key-values that allow x-ray to participate in that framework.
  • Users should be able to configure whether they prefer to parent against a w3c traceparent or the x-ray propagation by using the Propagator API.
  • The default priority for propagation within OTEL SDKs is (and should remain) the w3c traceparent. However, users may configure their distribution to prefer x-ray over w3c, which would give them the behavior that exists in javascript/python today.
  • We would like to explore how to deal with multiple potential parents, and attaching unchosen parents as "links" as future specification work and improvements to the Propagation API.
  • We encourage the FAAS group to provide prototypes and examples of the above.

I'll be ensuring this comment is shared in relevant PRs and issues, unfortuantely I could not find a central place for discussion, so I just chose the latest PR.

@trask
Copy link
Member

trask commented Aug 23, 2023

[xray propagation format] is the only mechanism presently supported by AWS Lambda within SQS message headers if you want AWS Lambda to participate in the trace.

it seems that this makes the xray propagation format essentially required in at least this one AWS scenario even if you are not using the X-Ray backend

in which case, I wonder if it would make sense to have two propagation formats:

  • xray - for users of the X-Ray backend who want to prioritize the X-Ray vended spans
  • aws - very similar to xray, but for users of other monitoring backends who are using AWS Lambda + SQS, but do not want to prioritize the X-Ray vended spans

@Aneurysm9
Copy link
Member Author

[xray propagation format] is the only mechanism presently supported by AWS Lambda within SQS message headers if you want AWS Lambda to participate in the trace.

it seems that this makes the xray propagation format essentially required in at least this one AWS scenario even if you are not using the X-Ray backend

in which case, I wonder if it would make sense to have two propagation formats:

  • xray - for users of the X-Ray backend who want to prioritize the X-Ray vended spans
  • aws - very similar to xray, but for users of other monitoring backends who are using AWS Lambda + SQS, but do not want to prioritize the X-Ray vended spans

I don't understand this. Propagation format and backend used are orthogonal concerns. The propagator should not know about the backend in use, only the carrier provided to it. What would be the functional difference between the two propagators you propose? I only see a difference in intent, not effect.

To the extent that this PR modifies the AWS Lambda instrumentation guidance to ensure that the user's chosen propagation implementation can be used can we separate this discussion? Can you open a new issue detailing your concerns and the approach you would like us to consider? I think that discussion would likely be relevant to more than just Lambda instrumentation.

@trask
Copy link
Member

trask commented Aug 23, 2023

I have a specific scenario that would help if I could understand:

How should a Dynatrace user of the OpenTelemetry Java agent configure OTEL_PROPAGATORS to work across SQS + Lambda?

My understanding (which could definitely be wrong) from above, is that if they use the default OTEL_PROPAGATORS (w3c), it won't work because SQS + Lambda won't propagate the traceparent header.

And so the Dynatrace user should use OTEL_PROPAGATORS=xray.

But how can this one setting serve both users who want to parent the X-Ray vended span and those who don't?

@trask
Copy link
Member

trask commented Aug 23, 2023

basically I'm suggesting that it may help to have two differently named xray propagation mechanisms, one which prioritizes the X-Ray vended span, and one which doesn't. it seems like this could make everyone happy. it fits into the TC's proposal above of relying on propagators to solve this problem. I believe it makes other backend monitoring vendors happy who want a way to NOT parent the X-Ray vended span. and I believe it makes AWS X-Ray users happy who DO want to parent the X-Ray vended span.

Comment on lines +60 to +61
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name
"X-Amzn-Trace-Id".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically in the context of this PR, what I’m asking about is whether it would make sense to populate a different field name here (instead of overwriting the non-vended parent), so that the choice of which one to parent can be made by the configured propagator(s).

Suggested change
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name
"X-Amzn-Trace-Id".
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name
"X-Amzn-Vended-Trace-Id".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that is appropriate as it would require different propagation formats to be used for extraction and injection and would involve OTel creating a new propagation format and attributing it to AWS. I believe the appropriate thing to do is to further extend the user agency offered by this specification to require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case

this makes sense to me

I think it brings us back to the question of default behavior, which my thought is that the default behavior should be the one which doesn't parent the vended span, since the vended span is specific to the X-Ray backend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's a question of default behavior at that point. I'm not aware of any SDK that configures the default global propagator to be or include the AWS X-Ray propagator outside of the ADOT Java agent. To the extent that most SDK implementations that set a default global propagator use a composite including tracecontext, baggage then the vended span would never be a parent. To the extent that a user has configured a different global propagator then we're out of the default behavior and into user-determined behavior and the user has the flexibility to choose the order they configure propagators in a composite propagator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the appropriate thing to do is to further extend the user agency offered by this specification to require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case.

I agree with extending user agency to pick which parent (or link) they want, but my understanding of the TC's "lean into propagators" meant that this would be better to do via propagators (and maybe propagator configuration) as opposed to via instrumentation configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're in agreement here. I might add only that instrumentation configuration should include propagator selection as there may be cases where specific instrumentation should be configured to use a different propagator, but that is orthogonal to how the instrumentation constructs the carrier to pass to whatever propagator it does use, which is the focus of this PR.

@Oberon00
Copy link
Member

Oberon00 commented Aug 24, 2023

My understanding (which could definitely be wrong) from above, is that if they use the default OTEL_PROPAGATORS (w3c), it won't work because SQS + Lambda won't propagate the traceparent header.

Not Dynatrace specific, but the Java AWS SDK instrumentation currently always sets the X-Ray system property to inject into SQS messages (since that one is for free), although there is also an option to additionally use the custom propagator. BTW, not only the property name but also which carrier you need to use is different for X-Ray vs everything else (message system vs message properties), so this would be really hard to represent in a generic propagator API (and personally I never thought it was a problem if instrumentations with library-specific propagation needs implement special handling directly)

If you want to use the OpenTelemetry Java agent with Dynatrace, you can just do it like with any other OTLP/HTTP backend, there are no special considerations (but also no explicit support for that). We do offer our own Lambda layers that come with an agent that is unrelated to the OTel Java agent, though it does share some basic technology and most concepts with core opentelemetry-java. These agents support W3C and Dynatrace-proprietary headers. Because the backend processing is reliant on some additional routing information (+ tenant-specific parent span ID) stored in the tracestate header, it cannot support the X-Ray propagation format (at least some AWS services are known to strip/normalize unknown parts out of the X-Ray string).

@jsuereth
Copy link
Contributor

I have a specific scenario that would help if I could understand:

How should a Dynatrace user of the OpenTelemetry Java agent configure OTEL_PROPAGATORS to work across SQS + Lambda?

My understanding (which could definitely be wrong) from above, is that if they use the default OTEL_PROPAGATORS (w3c), it won't work because SQS + Lambda won't propagate the traceparent header.

And so the Dynatrace user should use OTEL_PROPAGATORS=xray.

But how can this one setting serve both users who want to parent the X-Ray vended span and those who don't?

We're not suggesting either/or here. The idea is we lean into multiple propagation formats today. For Java, this would mean:

  • Dynatrace user would use OTEL_PROPAGATORS=w3c,xray
  • AWS x-ray user would use OTEL_PROPAGATORS=xray,w3c

Effectively, the order just determines the default parent when multiple propagation formats exist. So, for the lambda case where only xray propagation is used (e.g. in a message), then dynatrace would still see/parent the trace, albeit with missing spans. However, if the user had instrumentation where w3c headers propagate in the messages, they could parent directly if desired.

This does suggest that a Dynatrace user may lose completely the xray span if they select the w3c propagation as their default. However:

  • The ability to link vs. parent is something we'd like to solve consistently, across all instrumentation, not just for AWS.
    • This likely means improvements to the propagation specification as it stands.
    • The current recommendation from the Messaging WG is that each message SHOULD start a new trace and link against its source. This is at odds w/ what AWS is doing today, but is inline with what the Java instrumentation currently does.
  • We'd like to give users controls over nuances. While I expect the default in OTEL will be to abide by the messaging WG's definition, we should attempt to figure out how to also let users choose parent vs. link in message-passing triggering scenarios.

TL;DR; We think that discussion will take some time and deserves some prototypes before we move forward. However, moving forward on using propagators and config for the current issue should open some doors for us to proceed with that work.

@trask
Copy link
Member

trask commented Aug 24, 2023

OTEL_PROPAGATORS=w3c,xray

@jsuereth I thought @Aneurysm9 said above that traceparent header won't be propagated through SQS + Lambda, so how will this help in that scenario?

@jsuereth
Copy link
Contributor

@jsuereth I thought @Aneurysm9 said above that traceparent header won't be propagated through SQS + Lambda, so how will this help in that scenario?

I believe a user / instrumentation could encode w3c headers as message attributes (instead of message system attributes / x-ray) if desired. IIUC from the messaging OTEP, that's how we plan to support most messaging systems (encoding traceparent as an attribute in available locations).

@Aneurysm9
Copy link
Member Author

@jsuereth I thought @Aneurysm9 said above that traceparent header won't be propagated through SQS + Lambda, so how will this help in that scenario?

I believe a user / instrumentation could encode w3c headers as message attributes (instead of message system attributes / x-ray) if desired. IIUC from the messaging OTEP, that's how we plan to support most messaging systems (encoding traceparent as an attribute in available locations).

That is how I understand it as well. It is part of why I objected to requiring AWS SDK instrumentation to use the X-Ray propagator because I believed it was an instrumentation error to only rely on message system attributes when a context of any form could be carried through message attributes.

Don't confuse message attributes with message system attributes: Whereas you can use message attributes to attach custom metadata to Amazon SQS messages for your applications, you can use message system attributes to store metadata for other AWS services, such as AWS X-Ray.

More generally, to the extent that any system that requires X-Ray propagation to be used to have AWS services participate in a trace carries a user-defined payload then it is possible to carry context in any propagation format in that payload.

@trask
Copy link
Member

trask commented Aug 24, 2023

More generally, to the extent that any system that requires X-Ray propagation to be used to have AWS services participate in a trace carries a user-defined payload then it is possible to carry context in any propagation format in that payload.

I don't think carrying the context inside of the user-defined payload is a good solution for auto-instrumentation like the OTel Java agent(?)

EDIT or does the "user-defined payload" include generic headers/attributes, which would allow you to add the context there without affecting the downstream consumer code?

EDIT sorry, still processing, this may answer my question:

Don't confuse message attributes with message system attributes: Whereas you can use message attributes to attach custom metadata to Amazon SQS messages for your applications, you can use message system attributes to store metadata for other AWS services, such as AWS X-Ray.

so we could automatically attach custom metadata (i.e. traceparent) to SQS messages and then automatically retrieve that inside of Lambda execution?

@Aneurysm9
Copy link
Member Author

I don't think carrying the context inside of the user-defined payload is a good solution for auto-instrumentation like the OTel Java agent(?)

EDIT or does the "user-defined payload" include generic headers/attributes, which would allow you to add the context there without affecting the downstream consumer code?

EDIT sorry, still processing, this may answer my question:

Don't confuse message attributes with message system attributes: Whereas you can use message attributes to attach custom metadata to Amazon SQS messages for your applications, you can use message system attributes to store metadata for other AWS services, such as AWS X-Ray.

so we could automatically attach custom metadata (i.e. traceparent) to SQS messages and then automatically retrieve that inside of Lambda execution?

Yes, that is my understanding. From the AWS service perspective the message headers are part of the "user-defined payload" but from the user perspective they are separable from the message body in the same way HTTP headers are separable from a request body.

I think the one example that has been raised that may not fit neatly into that pattern is putting an object in an S3 bucket and having that action trigger a Lambda function invocation, such as through EventBridge. It's not clear to me that such architectures are amenable to a cleanly defined one-size-fits-all solution and may depend largely on the structure and content of the object stored.

@rapphil
Copy link
Contributor

rapphil commented Aug 24, 2023

This PR is a POC in Java that tries to implement the original proposal in this PR.

@joaopgrassi
Copy link
Member

joaopgrassi commented Aug 25, 2023

IIUC from the messaging OTEP, that's how we plan to support most messaging systems (encoding traceparent as an attribute in available locations).

EDIT or does the "user-defined payload" include generic headers/attributes, which would allow you to add the context there without affecting the downstream consumer code?

@trask Yes, within the messaging group we discussed this at length and the plan is to propagate the traceparent header on specific/specd locations, depending on the messaging protocol. Right now there are only drafts for such specifications, so the current conventions don't specify where instrumentations should attach them on purpose.

If you want to read more, here's the relevant portion from the OTEP: Standards for context propagation cc @pyohannes

@pyohannes
Copy link
Contributor

I believe a user / instrumentation could encode w3c headers as message attributes (instead of message system attributes / x-ray) if desired. IIUC from the messaging OTEP, that's how we plan to support most messaging systems (encoding traceparent as an attribute in available locations).

Yes exactly. The related wording already made it from an OTEP into the semantic conventions: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation. Note that the context we're referring to here isn't supposed to be changed by the intermediary/broker, but is intended to be used to correlate producer and consumer traces.

The working we chose is intentionally vague, in order to be applicable to messaging systems in general. More specific requirements could only be made on the level of messaging protocols (have a look at this section linked by @joaopgrassi above), but that's likely outside the scope of OTel, and more a W3C topic (that's where the current standardization drafts for Trace Context for AMQP and MQTT live).

If I understand correctly, some amount of confusion stems from the fact that in the AWS Lambda scenario, there can be two contexts to choose from for the consumer. In the messaging WG we were discussing the possibility of several layers of context: a message context layer tying together producers and consumers, and a transport context layer, tying together producers and intermediaries, and intermediaries and consumers. However, we decided to focus solely on the message context layer for an initial stable release of messaging semantic conventions.

@Oberon00
Copy link
Member

Oberon00 commented Aug 25, 2023

@jsuereth

We're not suggesting either/or here. The idea is we lean into multiple propagation formats today. For Java, this would mean:

  • Dynatrace user would use OTEL_PROPAGATORS=w3c,xray

I think this AWS x-ray thing is quite specific and best handled with instrumentation-specific options (of course these can be standardized cross-language). I would not use it as a motivation to add or change anything in the generic propagator API, unless you find actual real other non-AWS use cases that would benefit from this.

@Aneurysm9
Copy link
Member Author

I think this AWS x-ray thing is quite specific and best handled with instrumentation-specific options (of course these can be standardized cross-language). I would not use it as a motivation to add or change anything in the generic propagator API, unless you find actual real other non-AWS use cases that would benefit from this.

I'm not sure that I agree. I think there are use cases around migration from one context propagation format to another where both need to be accepted that could also benefit from enhancements to the specification regarding composite propagators. I do agree, however, that that is a separate discussion from what we're trying to address here. It would be useful, but is not necessary to reach a reasonable outcome.

@jsuereth
Copy link
Contributor

I think it's important to place a comment here as well:

OTEP 220 was approved and accepted into OpenTelemetry.

this means the default for message-passing in OpenTelemetry will be to link between producers and consumers. This is the direction we're moving and the acceptance of that PR represents consensus of this community and is not up for (re-)debate. Only the mechanism or how we migrate to this detail is up for debate.

@Aneurysm9
Copy link
Member Author

OTEP 220 was approved and accepted into OpenTelemetry.

this means the default for message-passing in OpenTelemetry will be to link between producers and consumers. This is the direction we're moving and the acceptance of that PR represents consensus of this community and is not up for (re-)debate. Only the mechanism or how we migrate to this detail is up for debate.

I can strike the paragraph regarding links from this PR or change if from MUST to SHOULD or MAY. I don't think it changes the core of this proposal. I included it as span linking appeared to be a significant concern in other proposals and wanted to ensure there was no doubt as to whether instrumentation adding a link to the span context carried in the lambda environment was acceptable.

@trask
Copy link
Member

trask commented Aug 29, 2023

I can strike the paragraph regarding links from this PR

this makes sense to me. this PR could then deal just with how to populate the carrier. this could help in the short-term because then different propagators can be configured to pick which parent you want. and it would still be helpful long-term once there is general spec support for composite propagators to fallback and extract span links if there's a parent has already been extracted.

…environment context

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@rapphil
Copy link
Contributor

rapphil commented Aug 30, 2023

I can strike the paragraph regarding links from this PR

POC updated to match the latest changes

`Context` will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can
disable AWS X-Ray for the function if the X-Ray Span Link is not desired.
If the `_X_AMZN_TRACE_ID` environment variable is set, instrumentation MUST ensure that the value from that
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to point out that as currently written, this document doesn't specify what should happen to X-Amzn-Trace-Id if it already exists in the carrier before the environment variable value is applied to it. I think that's an important behavior to define.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does define the behavior. It says that the instrumentation MUST ensure that value is in the carrier at that field name. The only way to do that it to set that value. Setting that value will override any value that may have existed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm saying that should be more explicit. It also doesn't give any option to users who wish to disable this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting that value will override any value that may have existed.

what would be the downside for this to be added as a separate key in the carrier, e.g. x-amzn-env-trace-id or x-amzn-vended-trace-id (instead of overriding the existing carrier value), this way the propagator chain can have access to all of the different propagated trace contexts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting that value will override any value that may have existed.

what would be the downside for this to be added as a separate key in the carrier, e.g. x-amzn-env-trace-id or x-amzn-vended-trace-id (instead of overriding the existing carrier value), this way the propagator chain can have access to all of the different propagated trace contexts?

I believe that I explained why that is not appropriate here.

I have repeatedly said that I believe that there needs to be more user agency afforded with regard to the construction of the carrier used for context extraction. Doing so would make it irrelevant what any of us think about what users want because they would be able to do what they want without configuration options or overrides. I have avoided going that far in this PR as here I am trying to fix a broken spec in the smallest possible way so that instrumentation can start offering more flexibility to users in a way that leaves the door open for further changes in that direction. Can we all agree that this change does that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W3C can only propagate over http headers or sqs/sns message attributes. That's fine in simple architectures as long as users haven't added many of their own attributes. AWS imposes a limit of 10 message attributes which means w3c propagation can fail if the limit is already reached. X-Ray is not subject to this limit as it propagates over message system attributes.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#sqs-message-system-attributes

Attributes also don't allow propagation when using S3 events or SNS->SQS type architectures (among other scenarios involving aws managed services). Tracing in these scenarios require x-ray. This was also the main motivation for open-telemetry/opentelemetry-specification#3212.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attributes also don't allow propagation when using S3 events or SNS->SQS type architectures (among other scenarios involving aws managed services). Tracing in these scenarios require x-ray.

is there any AWS (or other) doc about this, I feel like this is an important point that I'd like to understand better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level - The answers to these questions should be sufficient for acceptance of this PR -

  1. Does this give users the ability to control parenting choice via Propagator config?
  2. Does any of the normative language prevent moving to linking as the default behavior for Producer/Consumer message passing? Does it limit updating Propagator specification to allow this?

IIUC the details correctly, the answers are:

  1. No. Users will not have the ability to configure whether ENV variable or attribute is parented in this solution. By fiat, the ENV will always win.
  2. Yes. The x-ray header for SQS is passed as an attribute. Because the ENV variable would override this, we'd no longer see the alternative parent in the carrier sent to Propgators.

(2) Is less important given we still need to prototype and define changes to Propagation, and we could find workaround, but (1) is concerning to me.

I like @trask's proposals to be more explicit here, possibly having two x-ray propagators one which uses ENV and the other which does not so users can prioritize between them if necessary, and then ideally this would work with an improved propagation framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jsuereth,

Now that you have context with this proposal, there is also this other proposal that tries to solve the problem with a different direction and that I believe has "yes" as answer for questions 1 and 2: #164

While I agree that @trask 's proposal would work, I would prefer to avoid it given that this propagator would be specific to lambda (i.e.: users would not be able to use this propagator elsewhere) and it would not fully comply with the propagator API, specifically:

If a value can not be parsed from the carrier, for a cross-cutting concern, the implementation MUST NOT throw an exception and MUST NOT store a new value in the Context, in order to preserve any previously existing valid value.

Moreover, what would be the behaviour of the inject operation of this propagator?

I would prefer that the lambda instrumentation prepare a valid carrier to be consumed by subsequent propagators.

@tylerbenson
Copy link
Member

@jsuereth I think this can be closed in favor of #354

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 15, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants