Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-state-metrics breaking release aka 2.0 #569

Closed
12 tasks done
brancz opened this issue Oct 24, 2018 · 48 comments
Closed
12 tasks done

kube-state-metrics breaking release aka 2.0 #569

brancz opened this issue Oct 24, 2018 · 48 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. v2 version 2

Comments

@brancz
Copy link
Member

brancz commented Oct 24, 2018

We have accumulated a number of deprecated metrics and odd behaviors that I believe may justify a 2.0 release. I'd like to take this issue to discuss whether people think this is a good idea and collect what we would potentially like to break should we do a breaking release.

Off the top of my head breaking changes I would like to do:

I would see a breaking release at least 3 months out there, as I would like to validate the performance optimizations independently first. Further thoughts?

@andyxning @zouyee @mxinden

@andyxning
Copy link
Member

Agree with do a breaking release to clean up the kube-state-metrics. Actually performance optimizations is a feature. I am fine with a 2.1 release to add that feature.

rename black-/whitelist to allow/deny-list

I am fine with this change. But it seems that black-/whitelist names are also ok. What is the motivation for moving to allow/deny-list. And this apparently is an additional breaking change.

use same ports in all cases (currently the flag defaults to 80/81, but the dockerfile specifies 8080 and 8081)

Fine with this. I also think if we do this, we need to also deprecated the --telemetry-port and --telemetry-host. But what we do for adding telemetry port and host is mainly to split the metrics for kube-state-metrics itself and the metrics for Kubernetes. Merging them into one port is seems not an optimal option.

@brancz
Copy link
Member Author

brancz commented Oct 25, 2018

The performance improvements I would prefer to release in a 1.x release, and once we’re comfortable everything works correctly, we go ahead and do these breaking changes.

Renaming black/whitelist is for ethical reasons. I know it has been a used term but it doesn’t even describe well what it does, but it seems it’s not contentious :) .

I think keeping the telemetry host and port is fine, just the default port should be something other than 81.

@andyxning
Copy link
Member

Renaming black/whitelist is for ethical reasons. I know it has been a used term but it doesn’t even describe well what it does, but it seems it’s not contentious :) .

Seems reasonable to me. allow/deny is more accurate.

I think keeping the telemetry host and port is fine, just the default port should be something other than 81.

Any suggestion about the candidates? 82 or some value else? IMHO, 81 is good enough to do this job. :)

@andyxning
Copy link
Member

The performance improvements I would prefer to release in a 1.x release, and once we’re comfortable everything works correctly, we go ahead and do these breaking changes.

Agreed.

@brancz
Copy link
Member Author

brancz commented Oct 26, 2018

Sorry I should have been more clear about why I think we should change ports. Anything lower than 1024 requires root on Linux (or at least have the CAP_NET_ADMIN capability). That’s why default ports of kube-state-metrics should be higher than that. And beyond that, whatever we use should be consistent across the pure binary and the container.

@andyxning
Copy link
Member

Understood clearly. @brancz

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2019
@brancz
Copy link
Member Author

brancz commented Jan 26, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2019
@tariq1890
Copy link
Contributor

I would also like to propose that we update the k8s dependencies to 1.13 at the very least. Given that it has the potential of introducing backwards-incompatible/breaking changes, it would be a good candidate for the 2.0 release.

@brancz
Copy link
Member Author

brancz commented Feb 21, 2019

Do we have known incompatibilities? So far we only have the policy to support the past 4 versions of Kubernetes, as long as we stick with that it's not so much a "breaking" release.

@tariq1890
Copy link
Contributor

Nothing off the top my head. The client-go version support matrix is unfortunately not very clear in what they support/do not support. I can experiment with the latest k8s client-go dep and see how that turns out with the previous releases.

By past 4 versions of Kubernetes - Do you mean that with respect to the latest Kubernetes version out there or the latest Kubernetes version depped into kube-state-metrics?

@brancz
Copy link
Member Author

brancz commented Feb 21, 2019

By past 4 versions of Kubernetes - Do you mean that with respect to the latest Kubernetes version out there or the latest Kubernetes version depped into kube-state-metrics?

The later. Forward compatibility should work, but we can't guarantee it due to client-go.

@zuzzas
Copy link
Contributor

zuzzas commented Feb 25, 2019

@brancz
You've raised a motion about removing kube_[object]_owner metrics with <none> labels for Kubernetes objects that lack any owners.

I'd like to object!

We've got a use case for grouping Pods and other objects that lack any controller. This logic only works on ingestion with recording rules if we have a metric that clearly specifies the lack of owners on an object. We can't solve this riddle with absent(), since it requires clearly specifying an instant vector with the required labels.

Please, do not think ill of me after looking at a screenshot of this abomination:

image

@dohnto
Copy link

dohnto commented Feb 25, 2019

I might sound silly, but maybe we can unify the single/plural form of options --namespace/--collectors to be more uniform?

@brancz
Copy link
Member Author

brancz commented Feb 26, 2019

@zuzzas

Please, do not think ill of me after looking at a screenshot of this abomination

Haha, if anything I have more respect! That's a pretty nice recording rule, would you mind contributing that to the kubernetes-mixin?

I think you have a fair point, let's keep it. In theory this could be done with a join as well, but feels too difficult to get right. Thanks for bringing this up!

@dohnto :

I might sound silly, but maybe we can unify the single/plural form of options --namespace/--collectors to be more uniform?

Not at all silly. The --namespace flag internally is already called "namespaces", it is only singular for backward compatibility. I'll add making it plural to the list. Thanks for the suggestion, keep them coming! 🙂 .

zuzzas added a commit to zuzzas/kube-state-metrics that referenced this issue Feb 27, 2019
…empty"

This reverts commit 15b93c4.

Based on Brancz's decision here:
kubernetes#569 (comment)

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@brancz
Copy link
Member Author

brancz commented Apr 4, 2019

@sylr brought to my attention, that for a number of pod metrics, there is the node label. For consistency, these labels should not be on those metrics directly, but instead be joined onto the metrics at query time. Example:

LabelValues: []string{c.Name, p.Spec.NodeName, sanitizeLabelName(string(resourceName)), string(constant.UnitCore)},

@brancz
Copy link
Member Author

brancz commented May 20, 2019

Added "consider renaming"

@andyxning
Copy link
Member

consider renaming kube-state-metrics to kubernetes-exporter

LGTM.

@brancz
Copy link
Member Author

brancz commented Jun 5, 2019

Added

kube_secret_metadata_resource_version, kube_configmap_metadata_resource_version and kube_ingress_metadata_resource_version expose the resource version as a string in its set of labels. This value can change often and would therefore create huge cardinality. This should be a number or not existing at all.

Thanks to @xieyanker for noticing here: #777 (comment)

@lilic
Copy link
Member

lilic commented Jun 20, 2019

I would add to the list, renaming all the leftover user-facing occurrences of collectors to resources, as we recently removed the collectors package. That would also mean renaming collector in options to resource. Overall the --resources=pods flag would become more self-descriptive.

@brancz
Copy link
Member Author

brancz commented Jun 20, 2019

Great suggestion! Added.

@tariq1890
Copy link
Contributor

Let's add the Sharding feature to the list as well.

@brancz
Copy link
Member Author

brancz commented Jun 22, 2019

Sharding isn’t breaking so I feel it can be added in a backward compatible way in 1.x or 2.x. It’s fairly close to being ready I would say though so I’d like to see it go into a 1.x release.

@bboreham
Copy link
Contributor

Came here to +1 removal of high-cardinality kube_configmap_metadata_resource_version.
This one metric occupies 3% of all the data in our service.

I'm intrigued: what does anyone use this metric for, in its current form?

@lilic
Copy link
Member

lilic commented Jul 31, 2019

@bboreham you can also just blacklist those metrics for now until they are removed? --metric-blacklist="kube_configmap_metadata_resource_version" for example should work. :)

@brancz
Copy link
Member Author

brancz commented Oct 16, 2019

Doesn’t a new major release show that there are breaking changes? Of course we need to properly document these changes.

@lilic lilic added the v2 version 2 label Dec 4, 2019
@lilic
Copy link
Member

lilic commented Dec 4, 2019

All coresponding issues from above were created under the v2 version 2

Google doc for v2 release that we discussed during kubecon. https://docs.google.com/document/d/1lCvbvOAVFai7ciP_heZrJ_QXLOEaFu8ZBmwVVeCwf54/edit?usp=sharing

@brancz
Copy link
Member Author

brancz commented Dec 4, 2019

One more thing that came up during kubecon: Before we do the v2 release we probably want to do another round of scalability tests. I believe Google volunteered to do this.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2020
@olivierlemasle
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2020
@lilic
Copy link
Member

lilic commented Apr 22, 2020

@brancz anyone to ping regarding scalability tests? Or should we bring it up in the SIG call maybe? @tariq1890

@brancz
Copy link
Member Author

brancz commented Apr 22, 2020

SIG call sounds good

@brancz
Copy link
Member Author

brancz commented Sep 10, 2020

Looks like we've done everything on the list, let's get a pre-release started then! :)

@QuentinBisson
Copy link
Contributor

Should the kube-state-metrics module path not become module k8s.io/kube-state-metrics/v2 in go.mod as well for v2?

@brancz
Copy link
Member Author

brancz commented Sep 21, 2020

You're right, we forgot about that. @omegas27 do you want to take care of that?

@QuentinBisson
Copy link
Contributor

Sure 👍

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2021
@lilic
Copy link
Member

lilic commented Jan 19, 2021

/remove-lifecycle rotten

We are almost there 🎉

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 19, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2021
@lilic
Copy link
Member

lilic commented Apr 19, 2021

We finished and release is cut 🎉 Thank you all!!

@lilic lilic closed this as completed Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. v2 version 2
Projects
None yet
Development

No branches or pull requests