Add podGroup completed phase #2667

waiterQ · 2023-02-07T09:11:39Z

Modification Motivation

Volcano can schedule normal pods, but it shows up podGroup inqueue problem at some conditions. scheduling k8s-job,
podGroup is still at running phase when pod went to completed phase; changing deployment's requeuest will cause old podGroup in phase Inqueue.
So I pick and fix #2589 (Feature/add replicaset gc pg), and add phase Completed for normal pod podGroup to enhances podGroup.

Test Result

when deployment roll-updating, there're two podgroups exists.

when deployment roll-update is over, one podgroup left.

when k8s job is completed

podgroup is completed

ci e2e result

Signed-off-by: Gaizhi <donghouze@minimac.com>

Signed-off-by: shaoqiu <516595344@qq.com>

jiangkaihua · 2023-02-09T11:56:35Z

pkg/controllers/podgroup/pg_controller_handler.go

@@ -61,6 +63,26 @@ func (pg *pgcontroller) addPod(obj interface{}) {
 	pg.queue.Add(req)
 }

+func (pg *pgcontroller) addReplicaSet(obj interface{}) {


@waiterQ As far as I know, when replicaset was created, it would always be 0 replica. And after creating, the replicaset would scale up to defined replica numbers. So why deleting podgroup on both addReplicaSet and updateReplicaSet, but not only updateReplicaSet?

yes, you're right. In normal process with one version, volcano just need updateReplicaSet, and if consider the situation upgrade from a version to another, there isn't addReplicaSet help to cleanup stock podgroups in cluster. addReplicaSet is work with already-exist podgroups, addReplicaSet work with upcoming podgroups.

elinx · 2023-02-10T06:19:15Z

pkg/controllers/podgroup/pg_controller_handler.go

+		return
+	}
+
+	if *rs.Spec.Replicas == 0 {


There is a probability that there will be two replicasets with none zero replicas when doing roll upgrade which means two pg exists, does this matter?

I think this is about deployment's rollingUpdate strategy, in pod rolling creating, its definitely 2 kind pods exists. I think it's normal, not a problem.

william-wang · 2023-02-11T02:23:20Z

Please add the test results on the PR, thanks.

waiterQ · 2023-02-13T02:07:54Z

Please add the test results on the PR, thanks.

ok, done.

william-wang

/lgtm

william-wang · 2023-02-11T02:13:47Z

test/e2e/util/deployment.go

@@ -0,0 +1,189 @@
+/*
+Copyright 2021 The Volcano Authors.


the Copyright is not correct.

william-wang · 2023-02-11T02:18:17Z

test/e2e/schedulingbase/job_scheduling.go

+		Expect(len(pgs.Items)).To(Equal(1), "only one podGroup should be exists")
+	})
+
+	It("k8s Job", func() {


Please use a formal and complete description for func

william-wang · 2023-02-11T02:19:44Z

test/e2e/schedulingbase/job_scheduling.go

@@ -650,4 +650,62 @@ var _ = Describe("Job E2E Test", func() {
 		Expect(q2ScheduledPod).Should(BeNumerically("<=", expectPod/2+1),
 			fmt.Sprintf("expectPod %d, q1ScheduledPod %d, q2ScheduledPod %d", expectPod, q1ScheduledPod, q2ScheduledPod))
 	})
+
+	It("changeable Deployment's PodGroup", func() {


Please give a formal and complete description for func

william-wang · 2023-02-11T02:21:21Z

test/e2e/util/deployment.go

@@ -0,0 +1,189 @@
+/*


it is a util package, why name the file name as deployment.go

volcano-sh-bot · 2023-02-13T02:13:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [william-wang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zhoushuke · 2023-12-05T11:01:44Z

pkg/scheduler/framework/session.go

@@ -272,6 +272,10 @@ func jobStatus(ssn *Session, jobInfo *api.JobInfo) scheduling.PodGroupStatus {
 		// If there're enough allocated resource, it's running
 		if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
 			status.Phase = scheduling.PodGroupRunning
+			// If all allocated tasks is succeeded, it's completed
+			if len(jobInfo.TaskStatusIndex[api.Succeeded]) == allocated {


for batchv1 native job, if using .spec.completions and .spec.parallelism in job, for case, successed 10, in the same time, the queue is full, other 10 pod will pending, len(jobInfo.TaskStatusIndex[api.Succeeded]) == allocated will be true, job not finished but pg status is completed, would it happen?

volcano-sh-bot requested review from hwdef, merryzhou and Thor-wl February 7, 2023 09:11

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 7, 2023

waiterQ force-pushed the add-pg-completed branch 2 times, most recently from 9d31199 to 3a6c9e4 Compare February 7, 2023 09:19

waiterQ changed the title ~~Add pg completed~~ Add podGroup completed phase Feb 7, 2023

Gaizhi and others added 2 commits February 9, 2023 11:18

feature: add replicaset gc for podgroup

7a41c81

Signed-off-by: Gaizhi <donghouze@minimac.com>

add podGroup phase completed;

cf17b60

Signed-off-by: shaoqiu <516595344@qq.com>

waiterQ force-pushed the add-pg-completed branch from 3a6c9e4 to cf17b60 Compare February 9, 2023 03:19

jiangkaihua reviewed Feb 9, 2023

View reviewed changes

elinx reviewed Feb 10, 2023

View reviewed changes

william-wang approved these changes Feb 13, 2023

View reviewed changes

volcano-sh-bot assigned william-wang Feb 13, 2023

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2023

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2023

volcano-sh-bot merged commit 0254501 into volcano-sh:master Feb 13, 2023

wangyang0616 mentioned this pull request Mar 24, 2023

Add Event Handler for RS to GC Podgroup. New Solution to Fix Issue 2143 #2585

Closed

waiterQ deleted the add-pg-completed branch March 24, 2023 07:40

zhoushuke reviewed Dec 5, 2023

View reviewed changes

Monokaix mentioned this pull request Jan 18, 2024

Will PodGroup ownerReference enhancement be implemented? #3299

Open

bood mentioned this pull request Nov 5, 2024

Podgroup state changed from running to inqueue after pod deleted #2208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add podGroup completed phase #2667

Add podGroup completed phase #2667

waiterQ commented Feb 7, 2023 •

edited

Loading

jiangkaihua Feb 9, 2023

waiterQ Feb 9, 2023

elinx Feb 10, 2023

waiterQ Feb 10, 2023

william-wang commented Feb 11, 2023

waiterQ commented Feb 13, 2023

william-wang left a comment

william-wang Feb 11, 2023

william-wang Feb 11, 2023

william-wang Feb 11, 2023

william-wang Feb 11, 2023

volcano-sh-bot commented Feb 13, 2023

zhoushuke Dec 5, 2023 •

edited

Loading

Add podGroup completed phase #2667

Add podGroup completed phase #2667

Conversation

waiterQ commented Feb 7, 2023 • edited Loading

Modification Motivation

Test Result

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

william-wang commented Feb 11, 2023

waiterQ commented Feb 13, 2023

william-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

volcano-sh-bot commented Feb 13, 2023

zhoushuke Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

waiterQ commented Feb 7, 2023 •

edited

Loading

zhoushuke Dec 5, 2023 •

edited

Loading