Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) #2275

Merged
merged 6 commits into from
Feb 8, 2021

Conversation

ankeesler
Copy link
Contributor

Context

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 15, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @ankeesler. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 15, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Jan 15, 2021
[]string{},
)

execPluginFailedCalls = k8smetrics.NewCounterVec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we could collapse these rest_client_exec_plugin_calls and rest_client_exec_plugin_failed_calls metrics into a single metric with an "exitCode" label to indicate whether this was a successful call to the exec plugin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan might be able to give guidance here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman - do you have any guidance to give on if we could collapse rest_client_exec_plugin_calls and rest_client_exec_plugin_failed_calls into a single metric with an "exitCode" label? I feel like this would just clean things up a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's feasible, although I think the standard label would be code. See e.g. https://github.com/kubernetes/apiserver/blob/4cca99e7fbc563a06b3c505af12bfb90b79d3bcf/pkg/endpoints/metrics/metrics.go#L76-L87 for some inspiration - @logicalhan will probably have opinions if he gets a chance to look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ehashman - really appreciate your feedback here. I gave this simplification a go in 191094e.


```golang
var (
execPluginCertTTL = k8smetrics.NewGaugeFunc(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put a "command" label on these exec plugin metrics to partition them by the underlying exec binary/command? How will a user of a single client-go process with multiple exec plugins differentiate between metrics about one exec plugin from another?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.

primary metrics used by this feature set.

```golang
var (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the duration of the exec plugin call a useful metric? I could see it being helpful in debugging long-running API calls, but perhaps it is more effort than it is worth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman I'd also be interested in your comments here, if you have cycles. Have you seen this type of metric being useful to operators in other scenarios?

@ankeesler
Copy link
Contributor Author

/assign @liggitt
/assign @enj

assigning some folks that i know have some sort of temporal locality on this topic, but we can certainly assign elsewhere

@ankeesler
Copy link
Contributor Author

@logicalhan - hi there! I am 3 months late to the game here, but could I take you up on your offer to look at the proposed metrics for our feature? Do the structure of these metrics make sense?

Context

@enj
Copy link
Member

enj commented Jan 18, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 18, 2021
@ankeesler
Copy link
Contributor Author

hey @liggitt - do you have any cycles to give to this pr this week? i am cognizant that kep freeze is right around the corner and i am sure we will have some back and forth on this design.

@liggitt
Copy link
Member

liggitt commented Jan 25, 2021

cc @kubernetes/sig-instrumentation-api-reviews
/cc @logicalhan

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jan 25, 2021
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.


```golang
var (
execPluginCertTTL = k8smetrics.NewGaugeFunc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a note, rest_client_exec_plugin_ttl_seconds / rest_client_exec_plugin_certificate_rotation_age are metrics we already shipped and this is just updating the KEP

[]string{},
)

execPluginFailedCalls = k8smetrics.NewCounterVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan might be able to give guidance here


```golang
var (
execPluginCertTTL = k8smetrics.NewGaugeFunc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.

Comment on lines 836 to 840
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
- This feature set operates on the client-side.
- `rest_client_exec_plugin_ttl_seconds`: the expected lifetime of client-side certificates, in seconds
- `rest_client_exec_plugin_certificate_rotation_age`: the expected lifetime of client-side certificates, in seconds
- `rest_client_exec_plugin_calls`: 1 per the lifetime of the credential returned by the exec plugin
- `rest_client_exec_plugin_failed_calls`: 0, or a very low number compared to `rest_client_exec_plugin_calls`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics aren't SLOs, they're SLIs.

SLOs = what you expect as normal quality of service
SLIs = how you measure that

So, for example here, you might say "we target 0.01% unsuccessful calls in a moving 24h window", and then measure that using rest_client_exec_plugin_failed_calls / rest_client_exec_plugin_calls.

Personally I think this is a bit backwards in the template: you should first think about what you consider to be good quality of service, and then figure out how to measure it with metrics.

Copy link
Contributor Author

@ankeesler ankeesler Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I do see that I flipped that around. I made another pass at this section (and verbatim used your example, since it seems like a reasonable failure rate to me). I am trying to look at this like you said: start with the operations outcome, and then answer how to achieve that outcome with specific metrics. I still feel a little shaky on my first SLO, but I figured it would be better for me to get something down on paper for us to discuss sooner rather that later.

a5fa85a


execPluginCertRotation = k8smetrics.NewHistogram(
&k8smetrics.HistogramOpts{
Name: "rest_client_exec_plugin_certificate_rotation_age",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read through this KEP in detail so apologies if I'm missing something obvious but I don't really understand the bucketing on this metric. As an operator, I don't really care when a cert was last rotated, but I do care how close it is to expiry, which can't be inferred from rotation time. A histogram distribution of when certs were last rotated on the cluster isn't particularly useful.

Copy link
Contributor Author

@ankeesler ankeesler Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the feedback @ehashman. Appreciate you taking the time to look at this. This was another metric that has already shipped with this feature. I don't have a lot of context on why it was added originally, but can take a guess at the reason why it was added.

care how close it is to expiry

Do you think that an operator can get this from the rest_client_exec_plugin_ttl_seconds metric? I.e., if the rest_client_exec_plugin_ttl_seconds is really low (i.e., single digit seconds), then the operator knows that the cert was really close to expiry. The information from rest_client_exec_plugin_certificate_rotation_age could then help the operator answer the question "is the reason why my cert is so close to expiry when rotated because the exec authenticator is rotating certs too slowly?"

I feel like I see 2 paths forward here: 1) update the bucketing for this metric (can we do that in a backwards-compatible way?), 2) add another metric that indicates to the operator how close certs are to expiry when they are rotated. I feel like rest_client_exec_plugin_ttl_seconds gets at 2, so I would lean towards 1 (updating the metric somehow in a backwards compatible way, if that is even possible), or simply leaving this metric alone.

EDIT: I got these metrics totally wrong...see comment 2 below this one...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing came to mind here - if the question we are trying to answer is "how close is my cert to expiry", should we bucket a metric by "percentage of lifetime at which a cert is rotated"? I don't think we could do this to this metric in a backwards compatible way, but maybe we could add another metric if need-be? If we use a percentage of the lifetime, then I think the metric will be more helpful to an operator that may have some certs that are longer lifetime (1 year) vs shorter (1 week).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I realized now I got these metrics totally wrong. Slow moving this morning. Let me try to reply to your comment again.

care how close it is to expiry

OK, so an operator wants to figure out how close the certificate was to expiry. I'm going to assume that the operator knows the TTL of their certs, e.g., 1 week (604800 seconds). The way the code is written right now, the cert will only be rotated 1) when we make a request to the API with this credential and we get a 401 back or 2) we are about to make a request with the creds and we see that the current time is past the expiration time of the creds (per the exec plugin). That says to me that the operator should expect that all cert rotations happen at around 604800 seconds, i.e., certs are very close to expiry when they are rotated.

OK. I am going to stop typing now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do care how close it is to expiry

That's what the rest_client_exec_plugin_ttl_seconds metric is for. This metric is for detecting when rotations are happening much more frequently or less frequently than expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both!

@ankeesler ankeesler changed the title exec credential provider: first pass at metrics design exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) Jan 29, 2021
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@@ -0,0 +1,3 @@
kep-number: 541
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annajung - continuing conversation[0] about updating this kep for 1.21, I added this keps/prod-readiness/... file per your note. Let me know if this is not what we need.

[0] https://kubernetes.slack.com/archives/C0EN96KUY/p1611762222015600?thread_ts=1611606235.003600&cid=C0EN96KUY

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, this is exactly what's needed as noted in the PRR requirement, thanks!

@@ -22,4 +22,4 @@ latest-milestone: "v1.20"
milestone:
alpha: "v1.10"
beta: "v1.11"
stable: "v1.21"
stable: "v1.22"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annajung - continuing conversation[0] about updating this kep for 1.21, I updating this stable target per your note, let me know if this is not what we need.

[0] https://kubernetes.slack.com/archives/C0EN96KUY/p1611762222015600?thread_ts=1611606235.003600&cid=C0EN96KUY

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. In addition to this, I would ask to change the latest-milestone value to v.1.21.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! thanks for taking the time to review. I gave this a go here: 9a8390f

Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@ankeesler
Copy link
Contributor Author

@ehashman - thank you for your help thus far on this PR. From your perspective, are these metrics "good enough" to move forward with? I'm not trying to push you in any direction, I'm just looking for any more feedback with the hopes of merging the KEP before the 1.21 freeze. :)

We would also benefit from any feedback you or someone from sig-instrumentation is willing to give on a couple of specific metrics comments: https://github.com/kubernetes/enhancements/pull/2275/files#r566825710, https://github.com/kubernetes/enhancements/pull/2275/files#r566826217.

Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@ankeesler
Copy link
Contributor Author

@liggitt - how you feeling about that metrics shape? Think this is "good enough" to move forward with?

@ehashman
Copy link
Member

ehashman commented Feb 3, 2021

@ankeesler I think we need a PRR reviewer. If @logicalhan has a chance to TAL for instrumentation as well that'd be great. I have more experience with using/consuming the metrics than writing them.

@ankeesler
Copy link
Contributor Author

ankeesler commented Feb 4, 2021

@ehashman - sounds good, thanks for that direction.

In the past, @deads2k has reviewed this KEP from the PRR perspective. I was going to ask if he would be willing to give it another pass (especially because we need an approver for keps/sig-auth/OWNERS to meet the new PRR KEP requirements).

@deads2k - would you be willing to review the keps/sig-auth/OWNERS file and updated PRR section of this KEP as you have looked at it before?

@logicalhan - would love your feedback on this metrics shape if you have time before KEP freeze. :)

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits, but on the instrumentation side it looks okay to me.


execPluginCertRotation = k8smetrics.NewHistogram(
&k8smetrics.HistogramOpts{
Name: "rest_client_exec_plugin_certificate_rotation_age",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is very oddly named. I prefer rest_client_exec_plugin_certificate_lifespan_duration or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it is a little weird.

What is our API contract with respect to metric names? I think this metric has already been running in clusters since v1.18.0: kubernetes/kubernetes@e3b1cd1. Are we allowed to change metric names between releases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it came in in kubernetes/kubernetes#84382 (review)

I'd prefer we keep the current name unless it is broken.


execPluginCalls = k8smetrics.NewCounterVec(
&k8smetrics.CounterOpts{
Name: "rest_client_exec_plugin_calls",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counters are conventionally suffixed with rest_client_exec_plugin_call_total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. a764f28.

@deads2k
Copy link
Contributor

deads2k commented Feb 4, 2021

PRR lgtm, but I don't want to tag and accidentally approve a KEP. Get me on slack once the rest is lgtm/approved.

…t_exec_plugin_call_total

Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@liggitt
Copy link
Member

liggitt commented Feb 8, 2021

feedback on new metrics from sig-instrumentation is incorporated

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021
@deads2k
Copy link
Contributor

deads2k commented Feb 8, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankeesler, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit 4b5e2f9 into kubernetes:master Feb 8, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants