Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate all exposed Prometheus Metrics in KEDA Operator #3919

Closed
zroubalik opened this issue Nov 29, 2022 · 33 comments · Fixed by #3861
Closed

Consolidate all exposed Prometheus Metrics in KEDA Operator #3919

zroubalik opened this issue Nov 29, 2022 · 33 comments · Fixed by #3861
Assignees
Labels
feature-request All issues for new features that have not been committed to needs-discussion

Comments

@zroubalik
Copy link
Member

zroubalik commented Nov 29, 2022

Proposal

Currently we are exposing Prometheus Metrics on Scaling and Errors in Metrics adapter:
https://keda.sh/docs/2.8/operate/prometheus/#metrics-adapter

We also expose some metrics in Operator and also there is a new feature to expose additional metrics in Operator:
#3663

Old situation

### Metrics Server
keda_metrics_adapter_scaler_error_totals 
keda_metrics_adapter_scaled_object_error_totals
keda_metrics_adapter_scaler_errors
keda_metrics_adapter_scaler_metrics_value

### Operator
operator-sdk related metircs
# newly introduced:
keda_operator_resource_totals 
keda_operator_trigger_totals

New situation

Opeator is now the main source of metrics, keeping Metrics Server metrics as Deprecated for a some time, till #3930 is resolved
Some metric names were renamed.

### Operator
keda_scaler_error_totals 
keda_scaled_object_error_totals
keda_scaler_errors
keda_scaler_metrics_value
keda_resource_totals 
keda_trigger_totals

operator-sdk related metircs


### Metrics Server -- DEPRECATED!
keda_metrics_adapter_scaler_error_totals 
keda_metrics_adapter_scaled_object_error_totals
keda_metrics_adapter_scaler_errors
keda_metrics_adapter_scaler_metrics_value

This will help us in the future with multi-tenant story etc.

Related:
#3930

@zroubalik
Copy link
Member Author

@kedacore/keda-maintainers @v-shenoy PTAL^

@v-shenoy
Copy link
Contributor

v-shenoy commented Nov 29, 2022

I agree with this proposal. Having them all in one place makes sense, and the operator feels like the right place.

Considering the next release is going to be a breaking one anyway with the HPA API versions, might as well get any other breaking changes done along with it.

@tomkerkhove
Copy link
Member

This will be a breaking change, but I think it is beneficial and we should do this change now. It will help us in the future with multi-tenant story etc.

Sure but it's breaking :) So we can add them on the operator with the new names, but we cannot remove them on the metric server.

@tomkerkhove
Copy link
Member

Considering the next release is going to be a breaking one anyway with the HPA API versions, might as well get any other breaking changes done along with it.

I tend to disagree because this is not pushed on us by Kubernetes

@v-shenoy
Copy link
Contributor

v-shenoy commented Nov 29, 2022

This will be a breaking change, but I think it is beneficial and we should do this change now. It will help us in the future with multi-tenant story etc.

Sure but it's breaking :) So we can add them on the operator with the new names, but we cannot remove them on the metric server.

Wouldn't renaming the ones on the operator (except the ones not released) also be breaking? Also, this is obviously a maintainer's call on what breaking changes should / shouldn't be done. So, I leave that to all of you.

@tomkerkhove
Copy link
Member

tomkerkhove commented Nov 29, 2022

Based on this:

My suggestion is to move all these metrics to Operator and slightly rename them:

It looks like metrics in a new place so we can do anything we want :) But the ones we have today must be kept around.

@zroubalik
Copy link
Member Author

Honestly it would be hard to keep those on Metrics Server in a sensible way. And especially if we try to tackle the semi-multitenancy.

@v-shenoy
Copy link
Contributor

v-shenoy commented Nov 29, 2022

Based on this:

My suggestion is to move all these metrics to Operator and slightly rename them:

It looks like metrics in a new place so we can do anything we want :) But the ones we have today must be kept around.

Right, also I forgot the operator currently doesn't expose anything other than the default OperatorSDK stuff.

@zroubalik
Copy link
Member Author

But this #3861 is changing the way we are getting metrics (and exposing them as Prom metrics). It is no longer happening on Metrics Server, it is on Operator.

@JorTurFer
Copy link
Member

The problem is that currently we are goign to break the metrics server metrics because we are going to move the logic to generate them moving the metric querying to the operator. I think that we could reduce the impact if we add the scrapping to the operator as part of helm chart, but the change is there and we need it

@zroubalik
Copy link
Member Author

IMHO we are not doing a breaking change per se. We are still exposing the metrics, but only from a different endpoint with a slightly different name. I am okay with doing this, if this change is properly documented.

@v-shenoy
Copy link
Contributor

But this #3861 is changing the way we are getting metrics (and exposing them as Prom metrics). It is no longer happening on Metrics Server, it is on Operator.

So, if I understand this right, basically the operator will be doing all of the heavy lifting going forward. Not just reconciling the CRDs, but also fetching the metrics from the HPA, and caching them. And the metrics server will just query the operator?

@JorTurFer
Copy link
Member

I think this release as is already "disruptive" due to the support removal is a good release to do it if we document it clearly

@zroubalik
Copy link
Member Author

zroubalik commented Nov 29, 2022

But this #3861 is changing the way we are getting metrics (and exposing them as Prom metrics). It is no longer happening on Metrics Server, it is on Operator.

So, if I understand this right, basically the operator will be doing all of the heavy lifting going forward. Not just reconciling the CRDs, but also fetching the metrics from the HPA, and caching them. And the metrics server will just query the operator?

Yes, we are also reducing the number of opened connections from KEDA.

@tomkerkhove
Copy link
Member

tomkerkhove commented Nov 29, 2022

I think this release as is already "disruptive" due to the support removal is a good release to do it if we document it clearly

That is a different way of breaking given this is because of Kubernetes. Moving things to another endpoint and renaming them is going to break KEDA's operational story which is bad.

This is also not in line with the deprecation policy that we have discussed on kedacore/governance#70.

Active KEDA users will have to manually change to the new endpoint and use the new endpoints which means this is a breaking change.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 29, 2022

Active KEDA users will have to manually change to the new endpoint and use the new endpoints which means this is a breaking change.

We could cover this updating the chart, reducing the impact as much as possible. I think that helm is the most used tool for deploying KEDA, so if we do the needed changes there, we can make this change transparent to end-users

@tomkerkhove
Copy link
Member

And yet, not everyone uses Helm so it's a breaking change.

@zroubalik
Copy link
Member Author

Well, technically the deprecation policy is not yet valid :)

I really think this change is in users favour, consolidating the metrics in one place is a good thing and really, if we want to have some sort of multitenancy, we would have to do this change anyway. So my call is to do the change now, together with other breaking changes that we are introducing in 2.9.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 29, 2022

Technically, we are not introducing breaking changes in the operator as those metrics aren't released yet, they are only in main. WDYT if we add an endpoint in the metrics server that request the metrics to operator?
I know this won't scale and won't support multitenant, but in that case, we can enforce the change as this metrics-server endpoint must be deprecated in this version and deprecated features don't receive updates.
I also know that it's ugly and design pov, but this could avoid the breaking change, allowing us the moving...

@JorTurFer
Copy link
Member

WRT metric names, we can expose both in the operator, the old and new metrics, deprecating the old metric names.

@zroubalik
Copy link
Member Author

Are you up for implementing this? :) I was thinking about this originally, but dropped the idea for it being a messy solution.

@zroubalik
Copy link
Member Author

And it will block us of introducting the multi tenant solution till we fully drop metrics from metrics server. According to the gov policy in preparation it would mean a year.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 29, 2022

yes and no, We need to keep that metrics available a year, truth. But as deprecated feature, we could say:
Do you want multitenant? Metrics exposed by metrics server are deprecated and won't support it, migrate to the new one to have them working with multitenant

So basically, we will only expose metrics from the same namespace as metrics server from metrics server endpoint.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 29, 2022

I'm not proposing to have a multitenant solution exposing the metrics from metrics server at all, my proposal is just, go ahead with new metrics approach and "proxy" the old metric endpoint to the operator in the same namespace without multitenant support

@tomkerkhove
Copy link
Member

For me this is fairly clear on what we should do:

  1. Introduce the approach suggested above by @zroubalik as new "ideal" solution
  2. Introduce feature flag on metric server to still serve our current metrics (*)
  3. Ship with feature flag on by default but officially deprecating it
  4. Next KEDA v2.11 has the flag turned off by default

Existing end-users are not broken but are aware things are moving, and they can start migrating. In 2 releases we still give them the option to do so, but give them a nudge to move because it's off by default.

Am I missing something?

(*) Scrape metrics from operator and serve in metric server, we can add a small memcache here to reduce load on operator.

@v-shenoy
Copy link
Contributor

For me this is fairly clear on what we should do:

  1. Introduce the approach suggested above by @zroubalik as new "ideal" solution
  2. Introduce feature flag on metric server to still serve our current metrics (*)
  3. Ship with feature flag on by default but officially deprecating it
  4. Next KEDA v2.11 has the flag turned off by default

Existing end-users are not broken but are aware things are moving, and they can start migrating. In 2 releases we still give them the option to do so, but give them a nudge to move because it's off by default.

Am I missing something?

(*) Scrape metrics from operator and serve in metric server, we can add a small memcache here to reduce load on operator.

I am not sure if this is a dumb question. But why specifically v2.11?

@JorTurFer
Copy link
Member

I am not sure if this is a dumb question. But why specifically v2.11?

In progress deprecation policy says that deprecated features are removed in 4 releases (1 year), so 2.11 is just half the deprecation time

@tomkerkhove
Copy link
Member

I went with current + 2 which is 2.11, this should be OK to do as it's still around but we might want to clarify feature flags in the process.

@tomkerkhove
Copy link
Member

For me this is fairly clear on what we should do:

  1. Introduce the approach suggested above by @zroubalik as new "ideal" solution
  2. Introduce feature flag on metric server to still serve our current metrics (*)
  3. Ship with feature flag on by default but officially deprecating it
  4. Next KEDA v2.11 has the flag turned off by default

Existing end-users are not broken but are aware things are moving, and they can start migrating. In 2 releases we still give them the option to do so, but give them a nudge to move because it's off by default.

Am I missing something?

(*) Scrape metrics from operator and serve in metric server, we can add a small memcache here to reduce load on operator.

Proposal also added to policy: https://github.com/kedacore/governance/pull/70/files#r1034591287

@zroubalik zroubalik changed the title Consolidate exposed Prometheus Metrics in one place Consolidate all exposed Prometheus Metrics in KEDA Operator Nov 29, 2022
@zroubalik
Copy link
Member Author

I have updated the issue description. We will introduce the new metrics and consolide old ones in KEDA Operator, while keeping the old metrics in Metrics Server, which is deprecated option and will be removed in future KEDA releases.

Repository owner moved this from In Progress to Ready To Ship in Roadmap - KEDA Core Nov 29, 2022
@tomkerkhove
Copy link
Member

Re-opening as we need to announce deprecation as well

@tomkerkhove tomkerkhove reopened this Nov 30, 2022
Repository owner moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Nov 30, 2022
@tomkerkhove
Copy link
Member

@zroubalik Can you please announce the deprecation as per https://github.com/kedacore/governance/blob/main/DEPRECATIONS.md?

@tomkerkhove
Copy link
Member

Deprecation issue is #3972 with conversation on #3973

Repository owner moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to needs-discussion
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants