-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(instrumentation-aws-lambda): Changed capturing of X-Ray context as span link #1411
feat(instrumentation-aws-lambda): Changed capturing of X-Ray context as span link #1411
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1411 +/- ##
==========================================
- Coverage 96.06% 95.14% -0.93%
==========================================
Files 14 17 +3
Lines 914 1173 +259
Branches 199 244 +45
==========================================
+ Hits 878 1116 +238
- Misses 36 57 +21
|
@willarmiros Would you be able to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few concerns about the nature of this change, since it's the first I'm aware of it. I will also discuss w/ @Aneurysm9 offline.
@@ -50,8 +50,7 @@ In your Lambda function configuration, add or update the `NODE_OPTIONS` environm | |||
| --- | --- | --- | | |||
| `requestHook` | `RequestHook` (function) | Hook for adding custom attributes before lambda starts handling the request. Receives params: `span, { event, context }` | | |||
| `responseHook` | `ResponseHook` (function) | Hook for adding custom attributes before lambda returns the response. Receives params: `span, { err?, res? }` | | |||
| `disableAwsContextPropagation` | `boolean` | By default, this instrumentation will try to read the context from the `_X_AMZN_TRACE_ID` environment variable set by Lambda, set this to `true` to disable this behavior | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't removing this a breaking change?
return ROOT_CONTEXT; | ||
} | ||
|
||
private static _determineLinks(): Link[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that changing this behavior also being a breaking change. Will X-Ray customers who once saw their Lambda span as a direct child of the incoming context now only see it linked to the incoming context? Or will it still have a parent-child relationship (as before) but now ALSO have a link? I suppose this should have been discussed in the spec change process, but I wasn't aware of it :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willarmiros to be clear since you are the component owner, are you asking that we block this change? Since this comes from the spec, I would lean on the side of accepting it. Also, since this component is 0.x
version it should be expected that breaking changes are possible.
@willarmiros This change was introduced in this spec change and discussed in this issue. @Aneurysm9 was part of these discussions, and I think in favor of making the change. |
Thanks for the context - I don't want to block this change as it is indeed implementing the spec change. Can you explain though how customers can configure their Lambda functions/instrumentation to maintain the old experience (of reading from |
Maintaining the old behavior is no longer in the spec, and keeping it as an option has not been considered at this time (as far as I know). If you feel strongly that it should, I think we should update the spec first so that it is consistent across different implementations (there is a similar update for Python in progress). |
Makes sense @martinkuba, we may pursue updating the spec in the future. For now I am just trying to fully understand the change in behavior as it stands today. So the old default behavior was:
Now, after this change, the behavior will be:
Is that correct? |
There is still the |
@willarmiros your summary of the expected before and after behavior looks correct to me. |
Hi @carolabadeer, you are listed as the maintainer for this component. Can you please take a look at this change? It's been open for a long time, and we need help moving it forward. |
@Aneurysm9 any remaining comments? We'd like to move forward on this |
There needs to be a dead-simple mechanism for the user to get the same behavior they currently have after this change is made. This change in the Java instrumentation has caused customer pain and churn that we can and should be able to avoid. @bryan-aguilar and @rapphil can provide more details there. |
Hey @open-telemetry/javascript-approvers I suggest we go ahead and merge the current PR as this implements an experimental section of the Specification (there's already an issue regarding tracking the changes requested to make AWS users lives easier): open-telemetry/opentelemetry-lambda#714 |
Any reason this is not merged yet? @martinkuba |
@carlosalberto I believe |
Reiterating @Aneurysm9's concerns above:
Since this is a breaking change, it is important to have a configuration mechanism for choosing between keeping the current behaviour and using the new behaviour with span links. |
Breaking changes aren’t applicable when a spec or component is marked as experimental. Additionally, the technical committee has made the decision to move forward with this change. The community and sig welcome any proposals and example implementations to improve this behavior for users going forward. |
That is not in line with the guidance we have received regarding recent and planned changes to "experimental" semantic conventions that have been de facto stable. We have to acknowledge that we have shipped functionality based on these conventions for a long time and users have established an understanding that they are experimental, but relatively stable. A change like this without any means for the user to recover the prior behavior is significantly destabilizing and worthy of more critical thought than "it's experimental, so break whatever whenever".
No, they did not as far as I can tell. From their meeting notes on 2023-07-12:
The TC have explicitly deferred to the repository maintainers. How about we let them speak for themselves here. |
The reason it's not yet merged is because it does not seem to me that the code owners are satisfied with the change. quoting from: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/CONTRIBUTING.md#component-ownership
In this case, the component owners have requested that the original behavior be accessible behind a configuration flag because the change introduces breaking behavior. I believe that to be a reasonable request. There is an explanation of the breaking change here: open-telemetry/opentelemetry-lambda#714 (issue refers to java specifically, but it is the same change). There is an ongoing effort to introduce a long term fix for this issue here: open-telemetry/semantic-conventions#164. I had been holding off on merging this in the hopes that the PR author and the code owners could come to a resolution on their own, or that the question would be resolved in the specification which is still under active development. If we push through a change the code owners aren't happy with it calls into question the reasoning for even having code owners and may erode the trust between the maintainers and the code owners. Since the specification itself is still experimental, I don't believe the maintainers should necessarily force it through. If the spec was stable I might feel differently.
I'm not sure I agree with the conclusion here. I'm hesitant to merge a change which follows an experimental spec that is known to have issues that may require more changes in the future, particularly when the code owners have specifically raised concerns. My recommendation is to follow @Aneurysm9's original request and make the old behavior accessible by configuration. It could be argued whether the old or new behavior should be the default and I see merit in both choices there. It is my hope that the code owners and PR authors can resolve that question between themselves without maintainers needing to step in. |
@carlosalberto since you commented on this and you're a TC member I want to make sure you're OK with the above response. I assume a TC member only comments on a PR for a contrib repo when there is a specific reason to do so. |
Hey @carolabadeer
Please join the FaaS group to potentially discuss this. For the time being, as the related section doesn't include such flag, we should merge this change (as Java already did). |
Are you speaking on behalf of the TC? That is not the understanding I have from the notes of the TC discussion. The TC has said this should be handled by the maintainers. The maintainers have deferred to the component owners. The component owners are asking for this change. Why should we merge this PR without the approval of the code owners or maintainers? |
Absolutely. There's also the chance that AWS could keep a fork of this instrumentation, and do the proper massaging till they support Links (hopefully soon? they will be extensively used in Messaging). |
This is definitely possible but I think it's not ideal. I suspect it is what broken users are doing in Java and Python. I would expect it to work in most cases, but dependency management is tricky enough in JS as it is. |
We are not asking to return to the status quo ante. We are asking for an interim compromise that retains the option for users to choose which behavior they need.
This is not about supporting or not supporting links. It is about breaking trace context and not being able to represent a complete trace at all. Links are needed for users who are not using X-Ray so that they can connect to the spans that are vended by the Lambda service. Users who are using X-Ray need to have the correct parent segments so that they have a complete trace, including the spans that are vended by the lambda service. The breakage this introduces was clearly documented on open-telemetry/opentelemetry-lambda#714. |
Sorry for multiple comments in a row. New comments are coming in while I'm authoring replies to previous ones.
It's not a JS specific issue. I'm also surprised it hasn't been resolved in the spec and I've not been involved there so I don't have any context to speak to it. I'm just trying not to break existing JS users. The reason I'm focused on JS and not other languages is because I'm responsible for JS and not those other languages.
I really don't want to get to the point where we have to tell someone to fork the instrumentation. To me that is the worst case scenario that I hope we can avoid. |
To make sure we're all on the same page, here's what the specification currently says:
Let's break that down:
This says that instrumentation
Again, the language is
Here we encounter the only As it stands, the implementation is compliant with the specification, but presents issues for users who do not use AWS X-Ray. As-is this PR would be compliant with the specification, but would present issues for users who do use AWS X-Ray. If the author implements the change requested by both the code owner and maintainers, the instrumentation would still be compliant with the specification and would present options for all users to achieve their desired outcomes. @martinkuba as the author of this PR can you please implement the change that has been requested by the code owners and maintainers so that we can resolve this? |
I'd actually advise to hold off until the conversation is resolved and we are sure everyone is on the same page. |
Shall we all come to FaaS SIG meeting next Tuesday (Aug 15) and try to reach some consensus ? These spanlink prs are open for a long time (half year). Repo of different languages are in inconsistent state because of this. I am new to the community, just wondering, is there any sort of voting mechanism ? When we are in stalemate, we can still make decision and move forward ? |
This can be added to the agenda for the next SIG meeting, but attendance and participation at a SIG meeting cannot be the only way to make progress on this, or any, issue. Given that there is no spec compliance issue WRT the current state of this instrumentation or any of the proposed states discussed on this PR, I don't think it is necessary to discuss anything in the FaaS SIG to progress this PR. We can certainly discuss proposed changes to the spec, but the FaaS SIG's spec discussions are not what is holding this PR up at this time. |
I was already planning to come to this meeting. In my opinion they should be solving this there and we shouldn't be having these arguments in implementation SIG contrib repos.
No, there is no voting mechanism. Instead, there is a hierarchy with TC/GC at the top, maintainers beneath them, approvers beneath them, and contributors (and component owners in there somewhere around approvers). In this case the maintainer (me) has simply not decided to exercise their power yet. If I do decide to do so, I can settle the argument as long as my decision complies with the policies of the project (maintained by GC) and the specification (maintained by TC). If the parties involved are unhappy with the decision they could then escalate to the TC. Technically, the GC sits above the TC but in practice they are treated as equal. We are not in a stalemate in my opinion, because arguments and new ideas for solutions are still being formulated. If we reach a point where no new information is forthcoming and no decision has been made, then we will be in a stalemate. This is a natural part of the open source development process and it occasionally has hiccups that need to be addressed. If there is a true stalemate I will settle it as maintainer and we'll go from there. |
I'm also planning to bring this up in the maintainers meeting. I'd like to hear what the other maintainers think of how this situation should be handled or could have been handled better. I'm also not entirely sure the @open-telemetry/python-maintainers are aware that they're about to release a breaking change to their users, or that the @open-telemetry/java-instrumentation-maintainers are aware they have already done so (they might be aware but I just want to make sure). |
It may not be the only way but I believe it will be the fastest way.
Technically this is correct but it does hinge on deciding not to fulfill a In any case, I do not believe this is the correct place to have discussion about if the spec is or is not correct. The only thing clear to me is that both instrumentation versions "in the wild" are broken in one way or another regardless of spec compliance. I would much rather see this settled by the FaaS SIG. edit:
To be clear, I don't know if it will be resolved in the meeting, but I hope to impress on them that they need to address this and that they can't expect language sig maintainers to arbitrate these things without the context of the last 5 months of conversations. |
fwiw, we (Java) released this breaking change 4 months ago and haven't heard from anyone about it (cc @rapphil to keep me honest). we would be happy to accept a PR to opt-in to the prior behavior if the new behavior is causing problems for some users. |
btw I agree with @dyladan that this discussion should ideally be happening within the scope of the FAAS SIG, and they should ideally update the FAAS semconv to reflect the outcome of their discussion (and if the FAAS SIG is deadlocked then they can escalate to the TC) |
FYI we intend to create PRs to provide flags that will allow users to opt-in into the previous behaviour. The first pr is out for Python: open-telemetry/opentelemetry-python-contrib#1909 Next we intend to do this for Js and finally for Java. |
Is this in the spec or is it just a reaction to this conversation? |
The proposed behavior is compliant with the current language in the spec and achieves the intended goal of the spec change (as I understood it) of allowing users of all trace backends and context propagation methods to effectively utilize this instrumentation. |
I was asking if there have been spec changes specifically to address this. "Technically allowed" is different than "recommended". |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
Has there been any move to merge this? |
@ithompson-gp you can see a new proposal for how we intend to move forward here: Feel free to ping me on slack or join a SIG meeting if you have questions or concerns. |
Since this open-telemetry/semantic-conventions#354 has been merged, what is the status of the changing this ( Wondering of what mechanism will be put in place to properly respect the fact we - the user - would like to have the config option available to set Opening out the current implementation to make configurable Any update on this @tylerbenson @martinkuba ? |
@ithompson-gp The JS work to implement the latest spec is tracked here. I think there is a potential follow-up to this for handling links when there are multiple contexts available. This would need to be added to the spec first, and I am not planning on working on that at the moment. If you would like to discuss further, I suggest attending the FaaS SIG weekly meeting. |
Closing this since adding a link is no longer in the spec. |
Which problem is this PR solving?
The AWS Lambda spec has been updated with respect to handling the
_X_AMZN_TRACE_ID
env variable. Instead of using its context as a parent span, it now says that a link should be created instead. See this spec issue for more details.Short description of the changes
A span link will now be created for a sampled context in the X-Ray env variable. I have also removed the
disableAwsContextPropagation
configuration since it is not needed anymore.