-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce WatchListLatencyPrometheus measurement #2315
introduce WatchListLatencyPrometheus measurement #2315
Conversation
b9e3f4f
to
c61f100
Compare
/hold we should wait for kubernetes/kubernetes#120490 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You perhaps need to enable the feature gate somewhere no?
watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus" | ||
|
||
// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size | ||
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_cache_watch_list_duration_seconds{}[%v])) by (group, resource, scope, le))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not forgot to add the version label here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks!
https://github.com/kubernetes/test-infra/blob/a4926a4d0269828418698299e9643274ffd1b49a/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L20-L21 seems like the right place to add the featuregate |
I don't think so. I'm planning to use this measurement in the watchlist perf tests (#2316) which already setup the cluster to speak the streaming API. |
Isn't #2316 just a way to have the measurement displayed on the perf dashboard? Maybe it is already setup in test-infra, but I would have expected some code there to enable API streaming |
Ah I see, seems like you did the work to setup watchlist already: kubernetes/test-infra#29604 |
// watchListLatencyGatherer gathers 50th, 90th and 99th duration quantiles // for watch list requests broken down by group, resource, scope.
c61f100
to
bfc261b
Compare
/hold cancel this PR is ready for review |
watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus" | ||
|
||
// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size | ||
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go
can you suffix it with Simple?
basically this is simplified version of the slo and we should reflect it.
|
||
// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size | ||
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need "_bucket" at the end of metric name?
We're using it here:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L58
I would really prefer consistency between those two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - I started to wonder more generically - why can't we just reuse that other measurement that I linked?
I think we're effectively reimplementing the exact same logic and the only differences that we have are:
(1) we're using a different metric name
(2) the verb is always LIST
I think it should be possible to slightly refactor that other measurement and simply register two measurements there:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L81C3-L82C1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ApiResponsivenessGatherer
differs in a few places.
First of all, it has two different modes for getting the latency metrics (simple and extended).
In addition to gathering the latency, it collects two additional metrics: count and countFast.
The internal data structures hold all three metrics.
Once the metrics are collected, it supports reading a custom threshold from the config, which is used for further validation.
I think that the refactoring would boil down to creating "generic simple latency metrics," which could potentially be reused by both implementations.
Given that the internal data structures differ, the existing implementation would have to incorporate the generic latency metric and extend it.
Is this what you had in mind ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need "_bucket" at the end of metric name?
Yes you should use the buckets with histogram_quantile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p0lyn0mial - I'm a bit lost in your comment, so let me try to explain a bit deeper what I had in mind:
-
yes, there are two modes (simple and "normal", but the difference between these two is only how we're sampling the metrics. To be more specific, this is the only difference between these two modes:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L176-L204 -
From e2e user perspective, if I want to list my objects, it doesn't really matter if the server underneath is using the list method or the new watchlist protocol. I care about the latency of getting the result
-
Because of (2), we generally don't want to introduce a separate measurement, it should actually be part of exactly the same measurement (same config, same threshold, ....)
Although, initially we may want to split that a bit for debuggability reasons. -
So the way I think about what we should do is effectively:
- don't introduce new measurements at all - we will just slightly modify the existing one
- introduce queries that will provide us way to get sample
- in places where we gather them now, gather also the ones from watchlist:
- https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L176-L204
[I'm fine with starting only with simple query to show how it will work] - for now, just setting the "verb=watchlist" for those samples
- adding appropriate count queries too
- letting the existing logic to handle all of that as is [just separately for watchlist verb for now]
Once we prove that, we should actually merge the Samples for list & watchlist together, but let's do that as a follow-up and just start by treating them as separate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I understand, the new metric (watchlist) will end up on being reported as part of LoadResponsiveness_PrometheusSimple
for all jobs! I like it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #2764
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@p0lyn0mial: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: p0lyn0mial The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@p0lyn0mial: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The new PR is much better. Closing in favor of #2764 |
What type of PR is this?
/kind feature
What this PR does / why we need it:
WatchListLatencyPrometheus
measurement gathers 50th, 90th and 99th duration quantiles for watch list requests broken down by group, resource, scope.The new metric (kubernetes/kubernetes#120490) allows for comparing watch-list requests with standard list requests and measuring performance of the new requests in general.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
xref: kubernetes/enhancements#3157