-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec credential provider: prep for 1.21 (first pass at metrics design, PRR updates) #2275
Conversation
Hi @ankeesler. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[]string{}, | ||
) | ||
|
||
execPluginFailedCalls = k8smetrics.NewCounterVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyway we could collapse these rest_client_exec_plugin_calls
and rest_client_exec_plugin_failed_calls
metrics into a single metric with an "exitCode"
label to indicate whether this was a successful call to the exec plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan might be able to give guidance here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman - do you have any guidance to give on if we could collapse rest_client_exec_plugin_calls
and rest_client_exec_plugin_failed_calls
into a single metric with an "exitCode"
label? I feel like this would just clean things up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's feasible, although I think the standard label would be code
. See e.g. https://github.com/kubernetes/apiserver/blob/4cca99e7fbc563a06b3c505af12bfb90b79d3bcf/pkg/endpoints/metrics/metrics.go#L76-L87 for some inspiration - @logicalhan will probably have opinions if he gets a chance to look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
```golang | ||
var ( | ||
execPluginCertTTL = k8smetrics.NewGaugeFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put a "command"
label on these exec plugin metrics to partition them by the underlying exec binary/command? How will a user of a single client-go process with multiple exec plugins differentiate between metrics about one exec plugin from another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.
primary metrics used by this feature set. | ||
|
||
```golang | ||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the duration of the exec plugin call a useful metric? I could see it being helpful in debugging long-running API calls, but perhaps it is more effort than it is worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman I'd also be interested in your comments here, if you have cycles. Have you seen this type of metric being useful to operators in other scenarios?
@logicalhan - hi there! I am 3 months late to the game here, but could I take you up on your offer to look at the proposed metrics for our feature? Do the structure of these metrics make sense? Context
|
/ok-to-test |
e45e352
to
1846138
Compare
hey @liggitt - do you have any cycles to give to this pr this week? i am cognizant that kep freeze is right around the corner and i am sure we will have some back and forth on this design. |
cc @kubernetes/sig-instrumentation-api-reviews |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
||
```golang | ||
var ( | ||
execPluginCertTTL = k8smetrics.NewGaugeFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a note, rest_client_exec_plugin_ttl_seconds / rest_client_exec_plugin_certificate_rotation_age are metrics we already shipped and this is just updating the KEP
[]string{}, | ||
) | ||
|
||
execPluginFailedCalls = k8smetrics.NewCounterVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan might be able to give guidance here
|
||
```golang | ||
var ( | ||
execPluginCertTTL = k8smetrics.NewGaugeFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we include all the inputs (envvars, args, etc), which we wouldn't do for security and cardinality reasons, it seems likely we wouldn't be able to reliably distinguish different invocations.
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** | ||
- This feature set operates on the client-side. | ||
- `rest_client_exec_plugin_ttl_seconds`: the expected lifetime of client-side certificates, in seconds | ||
- `rest_client_exec_plugin_certificate_rotation_age`: the expected lifetime of client-side certificates, in seconds | ||
- `rest_client_exec_plugin_calls`: 1 per the lifetime of the credential returned by the exec plugin | ||
- `rest_client_exec_plugin_failed_calls`: 0, or a very low number compared to `rest_client_exec_plugin_calls` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics aren't SLOs, they're SLIs.
SLOs = what you expect as normal quality of service
SLIs = how you measure that
So, for example here, you might say "we target 0.01% unsuccessful calls in a moving 24h window", and then measure that using rest_client_exec_plugin_failed_calls / rest_client_exec_plugin_calls
.
Personally I think this is a bit backwards in the template: you should first think about what you consider to be good quality of service, and then figure out how to measure it with metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I do see that I flipped that around. I made another pass at this section (and verbatim used your example, since it seems like a reasonable failure rate to me). I am trying to look at this like you said: start with the operations outcome, and then answer how to achieve that outcome with specific metrics. I still feel a little shaky on my first SLO, but I figured it would be better for me to get something down on paper for us to discuss sooner rather that later.
|
||
execPluginCertRotation = k8smetrics.NewHistogram( | ||
&k8smetrics.HistogramOpts{ | ||
Name: "rest_client_exec_plugin_certificate_rotation_age", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read through this KEP in detail so apologies if I'm missing something obvious but I don't really understand the bucketing on this metric. As an operator, I don't really care when a cert was last rotated, but I do care how close it is to expiry, which can't be inferred from rotation time. A histogram distribution of when certs were last rotated on the cluster isn't particularly useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for the feedback @ehashman. Appreciate you taking the time to look at this. This was another metric that has already shipped with this feature. I don't have a lot of context on why it was added originally, but can take a guess at the reason why it was added.
care how close it is to expiry
Do you think that an operator can get this from the rest_client_exec_plugin_ttl_seconds metric? I.e., if the rest_client_exec_plugin_ttl_seconds is really low (i.e., single digit seconds), then the operator knows that the cert was really close to expiry. The information from rest_client_exec_plugin_certificate_rotation_age could then help the operator answer the question "is the reason why my cert is so close to expiry when rotated because the exec authenticator is rotating certs too slowly?"
I feel like I see 2 paths forward here: 1) update the bucketing for this metric (can we do that in a backwards-compatible way?), 2) add another metric that indicates to the operator how close certs are to expiry when they are rotated. I feel like rest_client_exec_plugin_ttl_seconds gets at 2, so I would lean towards 1 (updating the metric somehow in a backwards compatible way, if that is even possible), or simply leaving this metric alone.
EDIT: I got these metrics totally wrong...see comment 2 below this one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing came to mind here - if the question we are trying to answer is "how close is my cert to expiry", should we bucket a metric by "percentage of lifetime at which a cert is rotated"? I don't think we could do this to this metric in a backwards compatible way, but maybe we could add another metric if need-be? If we use a percentage of the lifetime, then I think the metric will be more helpful to an operator that may have some certs that are longer lifetime (1 year) vs shorter (1 week).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I realized now I got these metrics totally wrong. Slow moving this morning. Let me try to reply to your comment again.
care how close it is to expiry
OK, so an operator wants to figure out how close the certificate was to expiry. I'm going to assume that the operator knows the TTL of their certs, e.g., 1 week (604800 seconds). The way the code is written right now, the cert will only be rotated 1) when we make a request to the API with this credential and we get a 401 back or 2) we are about to make a request with the creds and we see that the current time is past the expiration time of the creds (per the exec plugin). That says to me that the operator should expect that all cert rotations happen at around 604800 seconds, i.e., certs are very close to expiry when they are rotated.
OK. I am going to stop typing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do care how close it is to expiry
That's what the rest_client_exec_plugin_ttl_seconds
metric is for. This metric is for detecting when rotations are happening much more frequently or less frequently than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both!
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
a5fa85a
to
36fac3d
Compare
@@ -0,0 +1,3 @@ | |||
kep-number: 541 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@annajung - continuing conversation[0] about updating this kep for 1.21, I added this keps/prod-readiness/...
file per your note. Let me know if this is not what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, this is exactly what's needed as noted in the PRR requirement, thanks!
@@ -22,4 +22,4 @@ latest-milestone: "v1.20" | |||
milestone: | |||
alpha: "v1.10" | |||
beta: "v1.11" | |||
stable: "v1.21" | |||
stable: "v1.22" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@annajung - continuing conversation[0] about updating this kep for 1.21, I updating this stable target per your note, let me know if this is not what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. In addition to this, I would ask to change the latest-milestone
value to v.1.21
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! thanks for taking the time to review. I gave this a go here: 9a8390f
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@ehashman - thank you for your help thus far on this PR. From your perspective, are these metrics "good enough" to move forward with? I'm not trying to push you in any direction, I'm just looking for any more feedback with the hopes of merging the KEP before the 1.21 freeze. :) We would also benefit from any feedback you or someone from sig-instrumentation is willing to give on a couple of specific metrics comments: https://github.com/kubernetes/enhancements/pull/2275/files#r566825710, https://github.com/kubernetes/enhancements/pull/2275/files#r566826217. |
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@liggitt - how you feeling about that metrics shape? Think this is "good enough" to move forward with? |
@ankeesler I think we need a PRR reviewer. If @logicalhan has a chance to TAL for instrumentation as well that'd be great. I have more experience with using/consuming the metrics than writing them. |
@ehashman - sounds good, thanks for that direction. In the past, @deads2k has reviewed this KEP from the PRR perspective. I was going to ask if he would be willing to give it another pass (especially because we need an approver for @deads2k - would you be willing to review the @logicalhan - would love your feedback on this metrics shape if you have time before KEP freeze. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nits, but on the instrumentation side it looks okay to me.
|
||
execPluginCertRotation = k8smetrics.NewHistogram( | ||
&k8smetrics.HistogramOpts{ | ||
Name: "rest_client_exec_plugin_certificate_rotation_age", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is very oddly named. I prefer rest_client_exec_plugin_certificate_lifespan_duration
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree it is a little weird.
What is our API contract with respect to metric names? I think this metric has already been running in clusters since v1.18.0
: kubernetes/kubernetes@e3b1cd1. Are we allowed to change metric names between releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it came in in kubernetes/kubernetes#84382 (review)
I'd prefer we keep the current name unless it is broken.
|
||
execPluginCalls = k8smetrics.NewCounterVec( | ||
&k8smetrics.CounterOpts{ | ||
Name: "rest_client_exec_plugin_calls", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counters are conventionally suffixed with rest_client_exec_plugin_call_total
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. a764f28.
PRR lgtm, but I don't want to tag and accidentally approve a KEP. Get me on slack once the rest is lgtm/approved. |
…t_exec_plugin_call_total Signed-off-by: Andrew Keesler <akeesler@vmware.com>
feedback on new metrics from sig-instrumentation is incorporated /lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ankeesler, deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Context