introduce WatchListLatencyPrometheus measurement #2315

p0lyn0mial · 2023-09-11T10:33:05Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

WatchListLatencyPrometheus measurement gathers 50th, 90th and 99th duration quantiles for watch list requests broken down by group, resource, scope.

The new metric (kubernetes/kubernetes#120490) allows for comparing watch-list requests with standard list requests and measuring performance of the new requests in general.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

xref: kubernetes/enhancements#3157

p0lyn0mial · 2023-09-11T13:57:42Z

/hold

we should wait for kubernetes/kubernetes#120490

dgrisonnet

You perhaps need to enable the feature gate somewhere no?

dgrisonnet · 2023-09-19T20:54:58Z

clusterloader2/pkg/measurement/common/watch_list_latency_prometheus.go

+	watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus"
+
+	// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
+	watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_cache_watch_list_duration_seconds{}[%v])) by (group, resource, scope, le))"


let's not forgot to add the version label here

yes, thanks!

clusterloader2/pkg/measurement/common/watch_list_latency_prometheus.go

dgrisonnet · 2023-09-19T21:06:37Z

https://github.com/kubernetes/test-infra/blob/a4926a4d0269828418698299e9643274ffd1b49a/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L20-L21 seems like the right place to add the featuregate

p0lyn0mial · 2023-09-20T07:32:13Z

You perhaps need to enable the feature gate somewhere no?

I don't think so. I'm planning to use this measurement in the watchlist perf tests (#2316) which already setup the cluster to speak the streaming API.

dgrisonnet · 2023-09-20T09:26:24Z

Isn't #2316 just a way to have the measurement displayed on the perf dashboard? Maybe it is already setup in test-infra, but I would have expected some code there to enable API streaming

dgrisonnet · 2023-09-20T09:28:40Z

Ah I see, seems like you did the work to setup watchlist already: kubernetes/test-infra#29604

// watchListLatencyGatherer gathers 50th, 90th and 99th duration quantiles // for watch list requests broken down by group, resource, scope.

p0lyn0mial · 2023-09-28T07:36:28Z

/hold cancel

this PR is ready for review

wojtek-t · 2023-11-24T14:52:45Z

clusterloader2/pkg/measurement/common/watch_list_latency_prometheus.go

+	watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus"
+
+	// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
+	watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))"


For consistency with https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go

can you suffix it with Simple?

basically this is simplified version of the slo and we should reflect it.

wojtek-t · 2023-11-24T14:55:00Z

clusterloader2/pkg/measurement/common/watch_list_latency_prometheus.go

+
+	// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
+	watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))"
+)


Don't we need "_bucket" at the end of metric name?
We're using it here:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L58

I would really prefer consistency between those two.

Actually - I started to wonder more generically - why can't we just reuse that other measurement that I linked?

I think we're effectively reimplementing the exact same logic and the only differences that we have are:
(1) we're using a different metric name
(2) the verb is always LIST

I think it should be possible to slightly refactor that other measurement and simply register two measurements there:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L81C3-L82C1

The ApiResponsivenessGatherer differs in a few places.

First of all, it has two different modes for getting the latency metrics (simple and extended).
In addition to gathering the latency, it collects two additional metrics: count and countFast.
The internal data structures hold all three metrics.
Once the metrics are collected, it supports reading a custom threshold from the config, which is used for further validation.

I think that the refactoring would boil down to creating "generic simple latency metrics," which could potentially be reused by both implementations.

Given that the internal data structures differ, the existing implementation would have to incorporate the generic latency metric and extend it.

Is this what you had in mind ?

Don't we need "_bucket" at the end of metric name?

Yes you should use the buckets with histogram_quantile.

@p0lyn0mial - I'm a bit lost in your comment, so let me try to explain a bit deeper what I had in mind:

yes, there are two modes (simple and "normal", but the difference between these two is only how we're sampling the metrics. To be more specific, this is the only difference between these two modes:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L176-L204

From e2e user perspective, if I want to list my objects, it doesn't really matter if the server underneath is using the list method or the new watchlist protocol. I care about the latency of getting the result

Because of (2), we generally don't want to introduce a separate measurement, it should actually be part of exactly the same measurement (same config, same threshold, ....)
Although, initially we may want to split that a bit for debuggability reasons.

So the way I think about what we should do is effectively:

don't introduce new measurements at all - we will just slightly modify the existing one

introduce queries that will provide us way to get sample

in places where we gather them now, gather also the ones from watchlist:

https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L176-L204
[I'm fine with starting only with simple query to show how it will work]

for now, just setting the "verb=watchlist" for those samples

adding appropriate count queries too

letting the existing logic to handle all of that as is [just separately for watchlist verb for now]

Once we prove that, we should actually merge the Samples for list & watchlist together, but let's do that as a follow-up and just start by treating them as separate things.

OK, I think I understand, the new metric (watchlist) will end up on being reported as part of LoadResponsiveness_PrometheusSimple for all jobs! I like it. Thanks.

created #2764

k8s-triage-robot · 2024-02-22T15:04:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-23T15:41:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-04-22T16:29:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-04-22T16:29:21Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

p0lyn0mial · 2024-04-23T08:05:01Z

/reopen

k8s-ci-robot · 2024-04-23T08:05:06Z

@p0lyn0mial: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-04-23T08:05:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: p0lyn0mial
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

clusterloader2/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2024-05-23T08:52:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-05-23T08:52:32Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

p0lyn0mial · 2024-05-23T10:03:37Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2024-05-23T10:03:42Z

@p0lyn0mial: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wojtek-t · 2024-07-26T12:10:38Z

The new PR is much better. Closing in favor of #2764

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 11, 2023

k8s-ci-robot requested review from mborsz and wojtek-t September 11, 2023 10:33

p0lyn0mial mentioned this pull request Sep 11, 2023

collect watch-list requests latency metric kubernetes/kubernetes#120490

Merged

p0lyn0mial force-pushed the upstream-watch-list-latency-measurment branch 2 times, most recently from b9e3f4f to c61f100 Compare September 11, 2023 13:49

p0lyn0mial changed the title ~~WIP: introduce WatchListLatencyPrometheus measurment~~ introduce WatchListLatencyPrometheus measurement Sep 11, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 11, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2023

p0lyn0mial mentioned this pull request Sep 12, 2023

config: add a new measurement to watchlist job type #2316

Closed

dgrisonnet reviewed Sep 19, 2023

View reviewed changes

measurment: introduce WatchListLatencyPrometheus measurment

bfc261b

// watchListLatencyGatherer gathers 50th, 90th and 99th duration quantiles // for watch list requests broken down by group, resource, scope.

p0lyn0mial force-pushed the upstream-watch-list-latency-measurment branch from c61f100 to bfc261b Compare September 25, 2023 13:30

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 28, 2023

wojtek-t reviewed Nov 24, 2023

View reviewed changes

wojtek-t self-assigned this Nov 24, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2024

k8s-ci-robot closed this Apr 22, 2024

k8s-ci-robot reopened this Apr 23, 2024

k8s-ci-robot closed this May 23, 2024

k8s-ci-robot reopened this May 23, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 23, 2024

wojtek-t closed this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce WatchListLatencyPrometheus measurement #2315

introduce WatchListLatencyPrometheus measurement #2315

p0lyn0mial commented Sep 11, 2023 •

edited

Loading

p0lyn0mial commented Sep 11, 2023

dgrisonnet left a comment

dgrisonnet Sep 19, 2023

p0lyn0mial Sep 20, 2023

dgrisonnet commented Sep 19, 2023 •

edited

Loading

p0lyn0mial commented Sep 20, 2023

dgrisonnet commented Sep 20, 2023

dgrisonnet commented Sep 20, 2023 •

edited

Loading

p0lyn0mial commented Sep 28, 2023

wojtek-t Nov 24, 2023

wojtek-t Nov 24, 2023

wojtek-t Nov 24, 2023

p0lyn0mial Jul 2, 2024

dgrisonnet Jul 2, 2024

wojtek-t Jul 3, 2024

p0lyn0mial Jul 3, 2024

p0lyn0mial Jul 8, 2024

k8s-triage-robot commented Feb 22, 2024

k8s-triage-robot commented Mar 23, 2024

k8s-triage-robot commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

p0lyn0mial commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

k8s-triage-robot commented May 23, 2024

k8s-ci-robot commented May 23, 2024

p0lyn0mial commented May 23, 2024

k8s-ci-robot commented May 23, 2024

wojtek-t commented Jul 26, 2024

introduce WatchListLatencyPrometheus measurement #2315

introduce WatchListLatencyPrometheus measurement #2315

Conversation

p0lyn0mial commented Sep 11, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

p0lyn0mial commented Sep 11, 2023

dgrisonnet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrisonnet commented Sep 19, 2023 • edited Loading

p0lyn0mial commented Sep 20, 2023

dgrisonnet commented Sep 20, 2023

dgrisonnet commented Sep 20, 2023 • edited Loading

p0lyn0mial commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Feb 22, 2024

k8s-triage-robot commented Mar 23, 2024

k8s-triage-robot commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

p0lyn0mial commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

k8s-ci-robot commented Apr 23, 2024

k8s-triage-robot commented May 23, 2024

k8s-ci-robot commented May 23, 2024

p0lyn0mial commented May 23, 2024

k8s-ci-robot commented May 23, 2024

wojtek-t commented Jul 26, 2024

p0lyn0mial commented Sep 11, 2023 •

edited

Loading

dgrisonnet commented Sep 19, 2023 •

edited

Loading

dgrisonnet commented Sep 20, 2023 •

edited

Loading