Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASP.NET Core] GRPC instrumentation support #1726

Open
vishweshbankwar opened this issue Nov 29, 2023 · 26 comments
Open

[ASP.NET Core] GRPC instrumentation support #1726

vishweshbankwar opened this issue Nov 29, 2023 · 26 comments
Labels
bug Something isn't working comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore

Comments

@vishweshbankwar
Copy link
Member

vishweshbankwar commented Nov 29, 2023

Issue

open-telemetry/opentelemetry-dotnet#5097 removes support for grpc instrumentation. This is done to unblock stable release of instrumentation library which will contain support for http server instrumentation. Semantic conventions for grpc is still in experimental state and hence the grpc instrumentation cannot be marked stable. https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md

Workaround

Going, forward the grpc instrumentation will be supported under an experimental feature flag. https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Instrumentation.AspNetCore/CHANGELOG.md#:~:text=Re%2Dintroduced%20support,enabled%20by%20default.

@cijothomas
Copy link
Member

Will this be added back in the next non-stable release? Or as experimental feature in stable?

@vishweshbankwar
Copy link
Member Author

The current plan is to bring back in the next non-stable release (Post GA).

@cartermp
Copy link

cartermp commented Dec 3, 2023

Why not just wait that extra month? There's no rush to push out all the (massively) breaking changes in HTTP instrumentation. I know of zero end-users who even know about this, let alone those who only just read the blog post about HTTP semconv stabilizing in November. And that blog post also only has a tiny number of views despite us trying to get the word out that things were going to change.

@vishweshbankwar
Copy link
Member Author

@cartermp Users have an option to stay on the last released beta version which defaults to older conventions.

@cijothomas
Copy link
Member

Why not just wait that extra month?

What benefit does waiting a month gives? gRPC conventions won't still be stable in a month, right?

Maybe we should make an opt-in experimental feature to re-enable the gRPC part, and make it available in the first stable release itself?

@cartermp
Copy link

cartermp commented Dec 5, 2023

Sorry, I crossed some streams here -- yes, gRPC isn't going stable in a month.

But I guess I don't understand what's going on anymore. If the goal is to stabilize the aspnetcore package, then adding gRPC support back in later makes it not stable anymore? What about new users who want the latest bug fixes (if they exist in the future) for aspnetcore instrumentation, and also want to use gRPC?

@cartermp
Copy link

cartermp commented Dec 5, 2023

I'll also reiterate that I don't think end-users have any hope at understanding what's happening here

@cijothomas
Copy link
Member

cijothomas commented Dec 5, 2023

If the goal is to stabilize the aspnetcore package, then adding gRPC support back in later makes it not stable anymore?

gRPC support will be under experimental feature flag only. The package will be stable, and offer aspnetcore instrumentation by default. Users who want to get gRPC feature, can opt-in for it, and it can remain opt-in until semantic conventions stabilize.

I think the first stable release should contain the exprimental+opt-in capability to support grpc attributes.

What about new users who want the latest bug fixes (if they exist in the future) for aspnetcore instrumentation, and also want to use gRPC?

Update to the newer versions having the bug fixes. For gRPC, same as my above comment : users can keep the opt-in feature for that.

@cijothomas
Copy link
Member

I'll also reiterate that I don't think end-users have any hope at understanding what's happening here

Can't really see how comments like this are useful to anyone!
If you are trying to help, please help with specific feedback, that we can address.

@cartermp
Copy link

cartermp commented Dec 5, 2023

My specific feedback is just hold off on releasing for a moment. If the plan is to reintroduce gRPC instrumentation, why not just build what you're planning to build and release that? There's three points of churn here:

  • gRPC instrumentation being removed entirely
  • HTTP semantic conventions causing breaking changes downstream (these have the potential to be enormous in their own right)
  • gRPC instrumentation being reintroduced, but as experimental only

Why do three when you can do two?

@cijothomas
Copy link
Member

just hold off on releasing for a moment.

How would that help anyone? The 1st stable release of instrumentations are expected in under 2 weeks, so we should move forward and fix gRPC issue either by putting it under experimental feature flag in the stable release, or in the pre-release to be released on same time as stable.

My personal preference is to put gRPC support back under experimental feature flag, and release it as part of the 1st stable release.

@vishweshbankwar
Copy link
Member Author

just hold off on releasing for a moment.

How would that help anyone? The 1st stable release of instrumentations are expected in under 2 weeks, so we should move forward and fix gRPC issue either by putting it under experimental feature flag in the stable release, or in the pre-release to be released on same time as stable.

My personal preference is to put gRPC support back under experimental feature flag, and release it as part of the 1st stable release.

I agree with this approach if gRPC instrumentation is what's needed then we can introduce an experimental flag OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION (or something like this). Note that users would still have to opt-in for it as opposed to having the instrumentation on by default.

Regarding

HTTP semantic conventions causing breaking changes downstream (these have the potential to be enormous in their own right)

This is an expected churn and was communicated well in advance. We have followed the formal transition plan defined in semantic conventions. All the languages are expected to go through it.

@cijothomas
Copy link
Member

@vishweshbankwar Could you start/prepare a PR doing OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION, and bring it to todays SIG call? It'd be good to get it added and released as rc2 asap, and stick with stable release for dec mid.

@cijothomas
Copy link
Member

@vishweshbankwar Could you start/prepare a PR doing OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION, and bring it to todays SIG call? It'd be good to get it added and released as rc2 asap, and stick with stable release for dec mid.

@cartermp Assume the above goes through, does it mitigate some/most/all concerns?

@cartermp
Copy link

cartermp commented Dec 5, 2023

Yes, thanks! I think that does help alleviate some concerns.

@martinjt
Copy link
Member

martinjt commented Dec 5, 2023

This is an expected churn and was open-telemetry/opentelemetry-dotnet#4686 well in advance.

Who is that you're expecting to have seen that issue and catered for it? The thousands of people affected by the issue who are downloading the package from Nuget after watching .NET Conf or one of the numerous talks about OpenTelemetry?

@cijothomas
Copy link
Member

This is an expected churn and was #4686 well in advance.

Who is that you're expecting to have seen that issue and catered for it? The thousands of people affected by the issue who are downloading the package from Nuget after watching .NET Conf or one of the numerous talks about OpenTelemetry?

Who is promising that this package is stable, and there won't be breaking change? The nuget is marked pre-release, and the readme.md clearly states its non-stable status. If someone else is falsely advertising the package as stable, what can this repo do? Did .NET Conf announced that OTel Instrumentation libraries are stable? OR are there talks which are falsely announcing OTel Instrumentation as stable?

Its totally understandable that the users would be affected by the breaking changes to this package, but it is inevitable for any non-stable package. If you want to help, please suggest specific, actionable suggestion. Happy to hear more ideas on how to communicate breaking changes more effectively, if you believe people won't read docs or look at changelogs. A few ideas were discussed in SIG meeting today, but they were relatively minor/small changes, except clearly documenting website docs: https://opentelemetry.io/docs/instrumentation/net/libraries/ (that is a separate issue that the docs from this repo and website are not yet merged :( )

@cijothomas
Copy link
Member

Yes, thanks! I think that does help alleviate some concerns.

I think you are still seeing more concerns which are not addressed. How can we help bridge the gap?

Today's SIG meeting discussed some ideas to better communicate breaking changes from this repo, pasting below for easy reference. Please add your feedback to the list as well. (This is separate from the overall opentelemetry.io blogs about semantic conventions changes etc., and is focused on what OTel .NET group can do)

_Improve documentation and communication of experimental features
Add link to GitHub readme from NuGet readme e.g., https://www.nuget.org/packages/OpenTelemetry.Instrumentation.AspNetCore#readme-body-tab

Add warnings to opentelemetry.io like are found on readmes: e.g., https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#aspnet-core-instrumentation-for-opentelemetry-net

Update install package step in readme to install explicit version https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#step-1-install-package_

@martinjt
Copy link
Member

martinjt commented Dec 6, 2023

Who is promising that this package is stable, and there won't be breaking change? The nuget is marked pre-release, and the readme.md clearly states its non-stable status.

I think you're fundamentally misunderstanding users here. When something is announced, and demoed at a major Microsoft conference, no amount of "It's in the README on Github" makes it visible to users that you could break their telemetry at any point.

As I said, this is IMPLICIT, at no point during the Demos, or in the docs, does it say that we will remove functionality at any time, as we choose, and if you want to know about that you'll need to be reading the Github issues constantly. Unless I missed that part of the talks that were on .NET Conf?

Maybe look here? https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-configuration?tabs=aspnetcore (which is Azure Monitor specific, but I don't think anyone is naïve enough to suggest that Azure docs aren't used by users to understand Microsoft support).

You seem to be suggesting that noone should be using this library in production, and using the fact that it isn't "Stable" as a catchall for breaking things. That isn't OK. If we don't want anyone using this in production, noone should be demoing it in talks, let alone key Microsoft people in a flagship conference, atleast with explicit caveats that it will break, and that they need to watch the github issues to stay up to date. I get why it wasn't mentioned, because it's not a good look for Microsoft to be bigging it up like that, when it's not supported.

I'm not a maintainer here, I'm the voice of the hundreds, if not thousands, of users of these libraries in production today that I speak to every day. I can help with messaging, I can help you understand the frustrations of users. I can't help if your answer is that this is not stable therefore we can break it any time. If the answer is that anything, at any time, can break, we need more explicit and loud warnings from Microsoft who are demoing this. Perhaps make every method as Experimental in the all the libraries, since that makes it explicit, and forces people to add flags to every library (we could actually then release everything as stable).

To be clear, in other languages, there is no concept of "prerelease" in the way that nuget does it. Therefore upgrades of libraries in the 0.x range can be scoped to minor/hotfix only, which we can't do in .NET. That's the reason this is a more specific problem for us.

@cartermp
Copy link

cartermp commented Dec 6, 2023

I think you are still seeing more concerns which are not addressed. How can we help bridge the gap?

@cijothomas I do, but it's not specific to this repo / it's not specifically a .NET SIG problem. I do think that bringing the gRPC instrumentation back in (behind a flag) at least does reduce some of the coming churn -- so thank you for doing that.

The HTTP semconv work is massively breaking for a large portion of our customers in several ways and represents an enormous amount of work for them (and us) to migrate. For some customers this is on the order of needing to update thousands of alerts, SLOs, and dashboards -- most people don't use automation like terraform to set these things up, and in several cases this can't even follow an automated process such as terraform due to things like a CONTAINS operator on http.target, which is split rather than name-changed.

And so there's several bad things that can quite easily happen here:

  • Sampling rules can subtly fail, causing someone's ingress to an o11y vendor to explode in size, affecting their bill, forcing rate limiting and rejected traffic, and/or causing an incident due to the sheer volume of traffic (we have several customers in this category)
  • Someone can update packages, notice nothing breaks (in their app), and then in the future there's an incident, and they find that the alert they'd set up never fired because it expected an old value
  • A team of hundreds of devs updates packages at random, with no central O11y team to manage telemetry wrappers for them (few orgs have this discipline), causing parts of the O11y backend to stop working, but not all, and the time between library update to detection can be long enough such that nobody can understand how it happened

We've done what we can to tell customers what's coming with the information we had, as I'm sure other vendors have, but the reality with enterprise software is that most people don't listen unless (a) something is horribly broken, or (b) they need to renew a contract. And while the OTel community has attempted to communicate things a bit, the reality is that it's been unactionable for end-users all year, there's only been two blog posts about this topic (each with only a few hundred views):

We have also struggled to understand the full impact and when it lands across all of OTel. The work was open, yes, but not terribly discoverable. It's been hard to also audit all language contrib repos and instrumentation libraries to understand what's affected and when we can reasonably expect to see breaking changes land.

@cijothomas
Copy link
Member

@cartermp

Thank you for sharing your perspective, which aligns well with my understanding of the issue. If users are dependent on specific telemetry attributes, such as the attribute names etc. in their dashboards and alerts, migration becomes an inevitable and necessary step. Although it's a painful and costly process, it's something we must accept. Semantic Conventions Working Group did offer a migration plan in place to smoothen the process. (Given conventions were experimental, they could have simply made breaking changes, but they still took the effort to offer migration plans, which is really appreciated.)
This plan suggested that instrumentations should include a feature flag to opt-in to new, old, or both conventions. After a grace period, the default would be set to follow the new convention. The instrumentations in this repository did adhere to this guideline. Of course, there are other instrumentation libraries too, and it's possible that not all of them have followed these guidelines.

Will do our best to help further. There is already an issue discussing some options, hopefully to further smoothen the process.

Given the challenging nature of preparing a perfect specification or convention on the first attempt, it's inevitable that OpenTelemetry components undergo the cycle of alpha/beta/stable/deprecated. Users who take the risk of using components before they reach stable status, and provide feedback, are making significant contributions to the greater good. If nobody used the components until they were stable, we would never receive any feedback, nor would we be able to make the necessary improvements to reach stability. It's essential to find the right balance between introducing new capabilities and being mindful of not disrupting existing users. Happy to hear more thoughts on improving the current process in this repo (and others).

@cijothomas
Copy link
Member

I think you're fundamentally misunderstanding users here. When something is announced, and demoed at a major Microsoft conference, no amount of "It's in the README on Github" makes it visible to users that you could break their telemetry at any point.

I can help you understand the frustrations of users.

I'd like to emphasize that we all share the commitment to understanding and prioritizing OpenTelemetry users' needs. It's a team effort, and each perspective is valuable. If you have any specific feedback or insights regarding this repository, please do share them.

we need more explicit and loud warnings from Microsoft who are demoing this.

As I mentioned earlier, if Microsoft (or any company) falsely promised that OTel semantic conventions are stable, when they were not, then I don't think this repo can do anything about it. I do see such warnings already in their docs. If you have other feedback on Azure/Microsoft docs, please report it at their channels.

You seem to be suggesting that noone should be using this library in production

I never said anything like that. The component's status is mentioned in https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#aspnet-core-instrumentation-for-opentelemetry-net. It says breaking changes are possible, and if users are willing to accept that, then nothing wrong with using in production!

@pyohannes
Copy link

If the answer is that anything, at any time, can break, we need more explicit and loud warnings from Microsoft who are demoing this.

I'll also have to deal with frustration related to this breaking change, however, it's inevitable. For me, a main point is that after this change nothing will break (without a main version increase), so I'm actually looking forward to this happening. Also, I think focusing on what was demoed on .NET Conf is a bit of a red herring, as the main problem lies with users who are using this packages for a longer time and rely on monitoring and alerting workflows that were built around it.

The one thing I'd suggest to make users aware that they'll get breaking changes is to change the main version number, and maybe release a 2.0 of those libraries. Although it's not strictly necessary according to versioning guidelines, it might help to warn users that something major is changing. I'm not sure though about all the implications, likely one would also need to increment the SDK version to keep things in sync.

@cijothomas
Copy link
Member

The one thing I'd suggest to make users aware that they'll get breaking changes is to change the main version number, and maybe release a 2.0 of those libraries

Yes! We should be following this guidance, if we ever make breaking changes post the stable release.
https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/VERSIONING.md#versioning-details

@cijothomas
Copy link
Member

@pyohannes On reading again.. were you referring to releasing the 1st stable version of instrumentations as 2.0, instead of the current plan of 1.6.0?
That'd look awkward, especially since we have decided some time ago to synchronize version of the instrumentations with the api/sdk, and making api/sdk to 2.0 will be confusing for users as they have no breaking change warranting major version bump!

@pyohannes
Copy link

That'd look awkward, especially since we have decided some time ago to synchronize version of the instrumentations with the api/sdk, and making api/sdk to 2.0 will be confusing for users as they have no breaking change warranting major version bump!

As this is the first stable release, going with 2.0 isn't a hard requirement for me, I just wanted to put out a proposal to possibly mitigate issues brought up here.

However, the logic of strict coupling versions of the SDK and of instrumentations seems a bit problematic, as the question will arise again once there are breaking changes in a stable instrumentation library that don't affect the SDK (hopefully not anytime soon) or vice versa. You'll be forced to either have an awkward main version increase without compatibility break, or you'll have decoupled versions.

@reyang reyang transferred this issue from open-telemetry/opentelemetry-dotnet May 13, 2024
@reyang reyang added the comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore
Projects
None yet
Development

No branches or pull requests

6 participants