exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) #2275

ankeesler · 2021-01-15T15:06:23Z

Context

We need to add some metrics to the exec plugin feature set to get it to GA quality.
This mostly came out of discuss on a previous PR here: exec credential provider: updates after sig-release feedback #2096 (comment).
I tried to take some initial strawperson opinions on other stuff. I will add some comments for questions that came up during the design.
Exec plugin feature set delivery is tracked here: External client-go credential providers #541.

k8s-ci-robot · 2021-01-15T15:06:31Z

Hi @ankeesler. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ankeesler · 2021-01-15T15:07:27Z

keps/sig-auth/541-external-credential-providers/README.md

+    []string{},
+  )
+
+  execPluginFailedCalls = k8smetrics.NewCounterVec(


Is there anyway we could collapse these rest_client_exec_plugin_calls and rest_client_exec_plugin_failed_calls metrics into a single metric with an "exitCode" label to indicate whether this was a successful call to the exec plugin?

@logicalhan might be able to give guidance here

@ehashman - do you have any guidance to give on if we could collapse rest_client_exec_plugin_calls and rest_client_exec_plugin_failed_calls into a single metric with an "exitCode" label? I feel like this would just clean things up a bit.

Yeah that's feasible, although I think the standard label would be code. See e.g. https://github.com/kubernetes/apiserver/blob/4cca99e7fbc563a06b3c505af12bfb90b79d3bcf/pkg/endpoints/metrics/metrics.go#L76-L87 for some inspiration - @logicalhan will probably have opinions if he gets a chance to look

Thanks @ehashman - really appreciate your feedback here. I gave this simplification a go in 191094e.

ankeesler · 2021-01-15T15:09:02Z

keps/sig-auth/541-external-credential-providers/README.md

+
+```golang
+var (
+  execPluginCertTTL = k8smetrics.NewGaugeFunc(


Should we put a "command" label on these exec plugin metrics to partition them by the underlying exec binary/command? How will a user of a single client-go process with multiple exec plugins differentiate between metrics about one exec plugin from another?

unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.

ankeesler · 2021-01-15T15:10:00Z

keps/sig-auth/541-external-credential-providers/README.md

+primary metrics used by this feature set.
+
+```golang
+var (


Is the duration of the exec plugin call a useful metric? I could see it being helpful in debugging long-running API calls, but perhaps it is more effort than it is worth.

@ehashman I'd also be interested in your comments here, if you have cycles. Have you seen this type of metric being useful to operators in other scenarios?

ankeesler · 2021-01-15T15:11:35Z

/assign @liggitt
/assign @enj

assigning some folks that i know have some sort of temporal locality on this topic, but we can certainly assign elsewhere

ankeesler · 2021-01-15T16:17:08Z

@logicalhan - hi there! I am 3 months late to the game here, but could I take you up on your offer to look at the proposed metrics for our feature? Do the structure of these metrics make sense?

Context

3 months ago, you kindly helped me spin up on how metrics work in Kube here: https://kubernetes.slack.com/archives/C20HH14P7/p1602607598033000?thread_ts=1602595459.031600&cid=C20HH14P7.
You had offered to look over the metrics and their schemas in the KEP, but I am just now getting around to writing that up.

enj · 2021-01-18T13:37:54Z

/ok-to-test

ankeesler · 2021-01-25T15:37:34Z

hey @liggitt - do you have any cycles to give to this pr this week? i am cognizant that kep freeze is right around the corner and i am sure we will have some back and forth on this design.

liggitt · 2021-01-25T16:29:30Z

cc @kubernetes/sig-instrumentation-api-reviews
/cc @logicalhan

fejta-bot · 2021-01-25T16:42:45Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

liggitt · 2021-01-25T16:31:01Z

keps/sig-auth/541-external-credential-providers/README.md

+
+```golang
+var (
+  execPluginCertTTL = k8smetrics.NewGaugeFunc(


as a note, rest_client_exec_plugin_ttl_seconds / rest_client_exec_plugin_certificate_rotation_age are metrics we already shipped and this is just updating the KEP

liggitt · 2021-01-25T16:32:54Z

keps/sig-auth/541-external-credential-providers/README.md

+    []string{},
+  )
+
+  execPluginFailedCalls = k8smetrics.NewCounterVec(


@logicalhan might be able to give guidance here

liggitt · 2021-01-25T22:46:04Z

keps/sig-auth/541-external-credential-providers/README.md

+
+```golang
+var (
+  execPluginCertTTL = k8smetrics.NewGaugeFunc(


unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.

ehashman · 2021-01-26T23:58:38Z

keps/sig-auth/541-external-credential-providers/README.md

 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  - This feature set operates on the client-side.
+  - `rest_client_exec_plugin_ttl_seconds`: the expected lifetime of client-side certificates, in seconds
+  - `rest_client_exec_plugin_certificate_rotation_age`: the expected lifetime of client-side certificates, in seconds
+  - `rest_client_exec_plugin_calls`: 1 per the lifetime of the credential returned by the exec plugin
+  - `rest_client_exec_plugin_failed_calls`: 0, or a very low number compared to `rest_client_exec_plugin_calls`


Metrics aren't SLOs, they're SLIs.

SLOs = what you expect as normal quality of service
SLIs = how you measure that

So, for example here, you might say "we target 0.01% unsuccessful calls in a moving 24h window", and then measure that using rest_client_exec_plugin_failed_calls / rest_client_exec_plugin_calls.

Personally I think this is a bit backwards in the template: you should first think about what you consider to be good quality of service, and then figure out how to measure it with metrics.

Ah, yes, I do see that I flipped that around. I made another pass at this section (and verbatim used your example, since it seems like a reasonable failure rate to me). I am trying to look at this like you said: start with the operations outcome, and then answer how to achieve that outcome with specific metrics. I still feel a little shaky on my first SLO, but I figured it would be better for me to get something down on paper for us to discuss sooner rather that later.

a5fa85a

ehashman · 2021-01-27T00:01:10Z

keps/sig-auth/541-external-credential-providers/README.md

+
+  execPluginCertRotation = k8smetrics.NewHistogram(
+    &k8smetrics.HistogramOpts{
+      Name: "rest_client_exec_plugin_certificate_rotation_age",


I haven't read through this KEP in detail so apologies if I'm missing something obvious but I don't really understand the bucketing on this metric. As an operator, I don't really care when a cert was last rotated, but I do care how close it is to expiry, which can't be inferred from rotation time. A histogram distribution of when certs were last rotated on the cluster isn't particularly useful.

OK, thanks for the feedback @ehashman. Appreciate you taking the time to look at this. This was another metric that has already shipped with this feature. I don't have a lot of context on why it was added originally, but can take a guess at the reason why it was added.

care how close it is to expiry

Do you think that an operator can get this from the rest_client_exec_plugin_ttl_seconds metric? I.e., if the rest_client_exec_plugin_ttl_seconds is really low (i.e., single digit seconds), then the operator knows that the cert was really close to expiry. The information from rest_client_exec_plugin_certificate_rotation_age could then help the operator answer the question "is the reason why my cert is so close to expiry when rotated because the exec authenticator is rotating certs too slowly?"

I feel like I see 2 paths forward here: 1) update the bucketing for this metric (can we do that in a backwards-compatible way?), 2) add another metric that indicates to the operator how close certs are to expiry when they are rotated. I feel like rest_client_exec_plugin_ttl_seconds gets at 2, so I would lean towards 1 (updating the metric somehow in a backwards compatible way, if that is even possible), or simply leaving this metric alone.

EDIT: I got these metrics totally wrong...see comment 2 below this one...

Another thing came to mind here - if the question we are trying to answer is "how close is my cert to expiry", should we bucket a metric by "percentage of lifetime at which a cert is rotated"? I don't think we could do this to this metric in a backwards compatible way, but maybe we could add another metric if need-be? If we use a percentage of the lifetime, then I think the metric will be more helpful to an operator that may have some certs that are longer lifetime (1 year) vs shorter (1 week).

OK I realized now I got these metrics totally wrong. Slow moving this morning. Let me try to reply to your comment again.

care how close it is to expiry

OK, so an operator wants to figure out how close the certificate was to expiry. I'm going to assume that the operator knows the TTL of their certs, e.g., 1 week (604800 seconds). The way the code is written right now, the cert will only be rotated 1) when we make a request to the API with this credential and we get a 401 back or 2) we are about to make a request with the creds and we see that the current time is past the expiration time of the creds (per the exec plugin). That says to me that the operator should expect that all cert rotations happen at around 604800 seconds, i.e., certs are very close to expiry when they are rotated.

OK. I am going to stop typing now.

I do care how close it is to expiry

That's what the rest_client_exec_plugin_ttl_seconds metric is for. This metric is for detecting when rotations are happening much more frequently or less frequently than expected.

Thanks both!

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

ankeesler · 2021-01-29T14:22:07Z

keps/prod-readiness/sig-auth/541.yaml

@@ -0,0 +1,3 @@
+kep-number: 541


@annajung - continuing conversation[0] about updating this kep for 1.21, I added this keps/prod-readiness/... file per your note. Let me know if this is not what we need.

[0] https://kubernetes.slack.com/archives/C0EN96KUY/p1611762222015600?thread_ts=1611606235.003600&cid=C0EN96KUY

perfect, this is exactly what's needed as noted in the PRR requirement, thanks!

ankeesler · 2021-01-29T14:22:26Z

keps/sig-auth/541-external-credential-providers/kep.yaml

@@ -22,4 +22,4 @@ latest-milestone: "v1.20"
 milestone:
  alpha: "v1.10"
  beta: "v1.11"
-  stable: "v1.21"
+  stable: "v1.22"


@annajung - continuing conversation[0] about updating this kep for 1.21, I updating this stable target per your note, let me know if this is not what we need.

[0] https://kubernetes.slack.com/archives/C0EN96KUY/p1611762222015600?thread_ts=1611606235.003600&cid=C0EN96KUY

This looks good. In addition to this, I would ask to change the latest-milestone value to v.1.21.

sounds good! thanks for taking the time to review. I gave this a go here: 9a8390f

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

ankeesler · 2021-02-02T04:09:35Z

@ehashman - thank you for your help thus far on this PR. From your perspective, are these metrics "good enough" to move forward with? I'm not trying to push you in any direction, I'm just looking for any more feedback with the hopes of merging the KEP before the 1.21 freeze. :)

We would also benefit from any feedback you or someone from sig-instrumentation is willing to give on a couple of specific metrics comments: https://github.com/kubernetes/enhancements/pull/2275/files#r566825710, https://github.com/kubernetes/enhancements/pull/2275/files#r566826217.

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

ankeesler · 2021-02-03T22:13:48Z

@liggitt - how you feeling about that metrics shape? Think this is "good enough" to move forward with?

ehashman · 2021-02-03T23:03:42Z

@ankeesler I think we need a PRR reviewer. If @logicalhan has a chance to TAL for instrumentation as well that'd be great. I have more experience with using/consuming the metrics than writing them.

ankeesler · 2021-02-04T13:48:14Z

@ehashman - sounds good, thanks for that direction.

In the past, @deads2k has reviewed this KEP from the PRR perspective. I was going to ask if he would be willing to give it another pass (especially because we need an approver for keps/sig-auth/OWNERS to meet the new PRR KEP requirements).

@deads2k - would you be willing to review the keps/sig-auth/OWNERS file and updated PRR section of this KEP as you have looked at it before?

@logicalhan - would love your feedback on this metrics shape if you have time before KEP freeze. :)

logicalhan

minor nits, but on the instrumentation side it looks okay to me.

logicalhan · 2021-02-04T21:03:21Z

keps/sig-auth/541-external-credential-providers/README.md

+
+  execPluginCertRotation = k8smetrics.NewHistogram(
+    &k8smetrics.HistogramOpts{
+      Name: "rest_client_exec_plugin_certificate_rotation_age",


This metric is very oddly named. I prefer rest_client_exec_plugin_certificate_lifespan_duration or something.

Yeah, I agree it is a little weird.

What is our API contract with respect to metric names? I think this metric has already been running in clusters since v1.18.0: kubernetes/kubernetes@e3b1cd1. Are we allowed to change metric names between releases?

Looks like it came in in kubernetes/kubernetes#84382 (review)

I'd prefer we keep the current name unless it is broken.

logicalhan · 2021-02-04T21:03:58Z

keps/sig-auth/541-external-credential-providers/README.md

+
+  execPluginCalls = k8smetrics.NewCounterVec(
+    &k8smetrics.CounterOpts{
+      Name: "rest_client_exec_plugin_calls",


Counters are conventionally suffixed with rest_client_exec_plugin_call_total.

Sounds good to me. a764f28.

deads2k · 2021-02-04T21:38:15Z

PRR lgtm, but I don't want to tag and accidentally approve a KEP. Get me on slack once the rest is lgtm/approved.

…t_exec_plugin_call_total Signed-off-by: Andrew Keesler <akeesler@vmware.com>

liggitt · 2021-02-08T15:26:24Z

feedback on new metrics from sig-instrumentation is incorporated

/lgtm

deads2k · 2021-02-08T15:44:43Z

/approve

k8s-ci-robot · 2021-02-08T15:44:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankeesler, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-auth/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 15, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 15, 2021

k8s-ci-robot requested review from deads2k and mikedanese January 15, 2021 15:06

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Jan 15, 2021

ankeesler commented Jan 15, 2021

View reviewed changes

k8s-ci-robot assigned enj and liggitt Jan 15, 2021

ankeesler mentioned this pull request Jan 15, 2021

External client-go credential providers #541

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 18, 2021

ankeesler force-pushed the exec-plugin-metrics branch from e45e352 to 1846138 Compare January 19, 2021 14:04

k8s-ci-robot requested a review from logicalhan January 25, 2021 16:29

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jan 25, 2021

liggitt reviewed Jan 25, 2021

View reviewed changes

ehashman reviewed Jan 27, 2021

View reviewed changes

ankeesler changed the title ~~exec credential provider: first pass at metrics design~~ exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) Jan 29, 2021

exec credential provider: first pass at metrics design

35617dc

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

ankeesler added 2 commits January 29, 2021 09:12

WIP: another pass at SLOs

efe27c7

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

exec credential provider: update PRR stuff for 1.21

36fac3d

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

ankeesler force-pushed the exec-plugin-metrics branch from a5fa85a to 36fac3d Compare January 29, 2021 14:20

ankeesler commented Jan 29, 2021

View reviewed changes

exec credential provider: latest milestone is v1.21, per comment

9a8390f

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

exec credential provider: try to simplify calls metric

191094e

Signed-off-by: Andrew Keesler <akeesler@vmware.com>

logicalhan reviewed Feb 4, 2021

View reviewed changes

exec credential provider: rest_client_exec_plugin_calls -> rest_clien…

a764f28

…t_exec_plugin_call_total Signed-off-by: Andrew Keesler <akeesler@vmware.com>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021

k8s-ci-robot merged commit 4b5e2f9 into kubernetes:master Feb 8, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 8, 2021

exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) #2275

exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) #2275

Conversation

ankeesler commented Jan 15, 2021

k8s-ci-robot commented Jan 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankeesler commented Jan 15, 2021

ankeesler commented Jan 15, 2021

enj commented Jan 18, 2021

ankeesler commented Jan 25, 2021

liggitt commented Jan 25, 2021

fejta-bot commented Jan 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankeesler Jan 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankeesler Jan 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankeesler commented Feb 2, 2021

ankeesler commented Feb 3, 2021

ehashman commented Feb 3, 2021

ankeesler commented Feb 4, 2021 • edited Loading

logicalhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Feb 4, 2021

liggitt commented Feb 8, 2021

deads2k commented Feb 8, 2021

k8s-ci-robot commented Feb 8, 2021

ankeesler Jan 29, 2021 •

edited

Loading

ankeesler Jan 29, 2021 •

edited

Loading

ankeesler commented Feb 4, 2021 •

edited

Loading