[KEP-3022] Write the production readiness requirements to graduate to beta #3338

sanposhiho · 2022-06-05T09:13:45Z

One-line PR description: Write the production readiness requirements to graduate to beta

Issue link: Min domains in PodTopologySpread #3022

Other comments:

sanposhiho · 2022-06-05T09:16:29Z

/cc @alculquicondor @Huang-Wei @wojtek-t

wojtek-t · 2022-06-06T05:53:50Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->

+It shouldn't impact already running workloads. It's an opt-in feature,


Will review tomorrow - in the meantime, if you're targeting beta, please update the kep.yaml and corresponding prr file.

Sure. will update.

There are new requirements for the Test Plan. Please check the updated template.

alculquicondor · 2022-06-06T18:10:56Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

 ###### What specific metrics should inform a rollback?

 <!--
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->

+A spike on metric `schedule_attempts_total{result="error|unschedulable"}` when pods using this feature are added.


I would also include a spike on scheduling_algorithm_duration_seconds, which would suggest that the feature is too slow.

👍
will add a mention of plugin_execution_duration_seconds as well.

alculquicondor · 2022-06-06T18:11:23Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->

+Yes. The behavior is changed as expected.


Describe the manual testing that was done :)

I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.

So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.

Okay, I'll do another manual test.

alculquicondor · 2022-06-06T18:15:41Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
 logs or events for this purpose.
 -->

+The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set.


I wonder if we could extend kubernetes/kubernetes#107556 to produce a metric.

But I wouldn't block beta graduation to this.

We definitely need kubernetes/kubernetes#107556...
I have not had enough time to work on this, but will make it a priority...

to produce a metric.

So, what metrics do you imagine like?

can you create an issue so we can discuss there?

But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.

Sure, will create that.

Opened: kubernetes/kubernetes#110643

alculquicondor · 2022-06-06T18:17:56Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -363,18 +378,19 @@ These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->

+- 99% of `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` are within x milliseconds.


you have to provide an actual value instead of x

alculquicondor · 2022-06-06T18:18:46Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -383,6 +399,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
 implementation difficulties, etc.).
 -->

+No.


Yes, as described in the How can an operator determine if the feature is in use by workloads? question

alculquicondor · 2022-06-06T18:19:09Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -466,8 +486,13 @@ details). For now, we leave it here.

 ###### How does this feature react if the API server and/or etcd is unavailable?

+The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd


Suggested change

The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd

The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd

alculquicondor · 2022-06-06T18:20:53Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -483,6 +508,9 @@ For each of them, fill in the following information by copying the below templat

 ###### What steps should be taken if SLOs are not being met to determine the problem?

+- Check `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` to see if latency increased. 
+- Check `schedule_attempts_total{result="error|unschedulable"}` to see if the number of attempts increased.


What should I do if I see problems in either of the metrics?

sanposhiho · 2022-06-14T13:04:00Z

@alculquicondor @wojtek-t
Updated. Please retake a look 🙏

alculquicondor · 2022-06-14T14:53:33Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/kep.yaml

-stage: alpha
-latest-milestone: "v1.24"
+stage: beta
+latest-milestone: "v1.25"
 milestone:
  alpha: "v1.24"
  beta: "v1.25"


can you remove the stable line?

We don't know yet if it will be done in 1.26 or not, but most likely 1.27+ (after 2 beta releases).

alculquicondor · 2022-06-14T14:54:15Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->

+It shouldn't impact already running workloads. It's an opt-in feature,


There are new requirements for the Test Plan. Please check the updated template.

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

alculquicondor · 2022-06-14T15:00:08Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->

+Yes. The behavior is changed as expected.


I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.

So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.

alculquicondor · 2022-06-14T15:01:31Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
 logs or events for this purpose.
 -->

+The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set.


can you create an issue so we can discuss there?

But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.

…duling result

sanposhiho · 2022-06-17T14:30:56Z

@alculquicondor
Thanks for reviewing. I updated as your suggestions.
(except #3338 (comment))

Please retake a look again. 🙏

alculquicondor · 2022-06-17T15:06:58Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+extending the production code to implement this enhancement.
+-->
+
+- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2020-06-17` - `86%`


also mention the pod validation packages

alculquicondor · 2022-06-17T15:08:05Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+We expect no non-infra related flakes in the last month as a GA graduation criteria.
+-->
+
+N/A


explain why:

possible explanations is that there are no new API endpoints and that the feature doesn't interact with other components, so E2E doesn't add extra value to integration tests.

alculquicondor · 2022-06-17T15:09:59Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

@@ -383,6 +469,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
 implementation difficulties, etc.).
 -->

+Yes. It would be useful if we could see more details related to scheduler's decisions in metrics.


Link to the issue.

Maybe be more specific about decisions. In this case, we would like to know which Filters had impact on the scheduling of the pod.

sanposhiho · 2022-06-17T16:06:39Z

@alculquicondor

Applied your new suggestions.

alculquicondor

/lgtm
/approve

sanposhiho · 2022-06-17T18:47:59Z

@alculquicondor

I just updated the upgrade/rollback manual test section.
#3338 (comment)

sanposhiho · 2022-06-17T18:48:03Z

@wojtek-t Could you take a look at this PR as well? 🙏

johnbelamaric

I'm satisfied with the PRR answers, awaiting SIG approval.

johnbelamaric · 2022-06-21T00:41:25Z

keps/prod-readiness/sig-scheduling/3022.yaml

@@ -4,3 +4,5 @@
 kep-number: 3022
 alpha:
  approver: "@wojtek-t"
+beta:
+  approver: "@wojtek-t"


Wojtek is out, please change to me.

Sure, thanks for taking over! 🙏

johnbelamaric · 2022-06-21T00:50:56Z

I'm satisfied with the PRR answers, awaiting SIG approval.

Oh, I see SIG approval. Just change the PRR approver to me, so if all hell breaks loose Wojtek doesn't get blamed :-P

sanposhiho · 2022-06-21T11:51:47Z

@johnbelamaric Updated.

johnbelamaric · 2022-06-21T21:23:41Z

/approve

k8s-ci-robot · 2022-06-21T21:24:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, johnbelamaric, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2022-06-22T01:31:27Z

@johnbelamaric @alculquicondor
Would you mind adding /lgtm? It's needed for merging.

alculquicondor · 2022-06-22T15:10:02Z

/lgtm

Thanks!

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 5, 2022

k8s-ci-robot requested a review from ahg-g June 5, 2022 09:14

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 5, 2022

k8s-ci-robot requested a review from Huang-Wei June 5, 2022 09:14

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 5, 2022

sanposhiho force-pushed the beta branch from 2f12d95 to 419bd56 Compare June 5, 2022 09:16

k8s-ci-robot requested review from alculquicondor and wojtek-t June 5, 2022 09:16

wojtek-t self-assigned this Jun 6, 2022

wojtek-t reviewed Jun 6, 2022

View reviewed changes

alculquicondor reviewed Jun 6, 2022

View reviewed changes

alculquicondor reviewed Jun 14, 2022

View reviewed changes

jasonbraganza mentioned this pull request Jun 16, 2022

Min domains in PodTopologySpread #3022

Closed

12 tasks

sanposhiho added 9 commits June 17, 2022 19:42

Add the production readiness requirements to graduate to beta

9a20a85

update to beta

ad9b01c

describe the scenario for update/rollback test

c2e1a16

add metrics for latency

7e6a2ed

set actual value for SLOs

74166ed

add mention for useful metrics that allows us to see more detail sche…

266197d

…duling result

add step to solve

817fdb2

fix typo

603325a

address review

1a9786c

sanposhiho force-pushed the beta branch from ef835f7 to 1a9786c Compare June 17, 2022 14:11

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2022

update toc

ede06eb

alculquicondor reviewed Jun 17, 2022

View reviewed changes

sanposhiho mentioned this pull request Jun 17, 2022

metrics to see which plugins affect to scheduler's decisions in Filter/Score phase kubernetes/kubernetes#110643

Closed

sanposhiho added 3 commits June 18, 2022 01:04

write more specific about metrics

a56d6cf

write why we don't need e2e

5343564

add another packages tests

a0f4072

alculquicondor reviewed Jun 17, 2022

View reviewed changes

k8s-ci-robot assigned alculquicondor Jun 17, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2022

Update the manual test for update/rollback

763ae6d

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2022

sanposhiho mentioned this pull request Jun 20, 2022

Graduate MinDomains in Pod Topology Spread to beta kubernetes/kubernetes#110388

Merged

johnbelamaric reviewed Jun 21, 2022

View reviewed changes

Change PRR approver to johnbelamaric

2c50304

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2022

k8s-ci-robot merged commit 9b475dc into kubernetes:master Jun 22, 2022

k8s-ci-robot added this to the v1.25 milestone Jun 22, 2022

		@@ -466,8 +486,13 @@ details). For now, we leave it here.

		###### How does this feature react if the API server and/or etcd is unavailable?

		The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd

	The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd
	The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd

[KEP-3022] Write the production readiness requirements to graduate to beta #3338

[KEP-3022] Write the production readiness requirements to graduate to beta #3338

Conversation

sanposhiho commented Jun 5, 2022

sanposhiho commented Jun 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jun 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jun 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jun 17, 2022

alculquicondor left a comment

Choose a reason for hiding this comment

sanposhiho commented Jun 17, 2022

sanposhiho commented Jun 17, 2022

johnbelamaric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Jun 21, 2022

sanposhiho commented Jun 21, 2022

johnbelamaric commented Jun 21, 2022

k8s-ci-robot commented Jun 21, 2022

sanposhiho commented Jun 22, 2022

alculquicondor commented Jun 22, 2022

sanposhiho Jun 14, 2022 •

edited

Loading

sanposhiho Jun 14, 2022 •

edited

Loading