Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase waiting time to verify workload completion. #2463

Merged
merged 1 commit into from
Jul 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion test/e2e/singlecluster/pod_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,7 @@ var _ = ginkgo.Describe("Pod groups", func() {
var p corev1.Pod
g.Expect(k8sClient.Get(ctx, replKey, &p)).To(gomega.Succeed())
g.Expect(p.Status.Phase).To(gomega.Equal(corev1.PodSucceeded))
}, util.Timeout, util.Interval).Should(gomega.Succeed())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, why it's take so much time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, time increase in here is just to ensure our testing infra had no "hiccup", because it happened in the past that it was overwhelmed.
Also it doesn't affect the happy path.

Copy link
Contributor

@mbobrovskyi mbobrovskyi Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why we have "hiccup"? How much time is needed in a successful case?

Copy link
Contributor Author

@mszadkow mszadkow Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked on daily that it is most probably resource exhaustion or something with test infra, but I dug deeper...

Successful takes something around 1-2 seconds
e.g.

 �[1mSTEP:�[0m Verify the replacement pods of the default priority workload complete �[38;5;243m@ 06/21/24 17:12:58.032�[0m
 �[1mSTEP:�[0m Verify the default priority workload is finished �[38;5;243m@ 06/21/24 17:13:00.065�[0m

And all events (of Kueue) came between the eventually expired:

2024-06-21T17:13:00.029650262Z stderr F 2024-06-21T17:13:00.02943541Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished", "prevStatus": "admitted", "clusterQueue": "cq"}
2024-06-21T17:13:00.029797574Z stderr F 2024-06-21T17:13:00.029638912Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}}
2024-06-21T17:13:00.029807114Z stderr F 2024-06-21T17:13:00.029726863Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "kind": "/, Kind="}
2024-06-21T17:13:00.057307688Z stderr F 2024-06-21T17:13:00.057121706Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished", 

However for failed run:
We didn't get the events at all (eventually expired - 15:28:01.191), errors below.
The workload was admitted only after expected time, so no chance to finish as teardown kicked in.

2024-06-17T15:28:01.509816691Z stderr F 2024-06-17T15:28:01.509626579Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.509941372Z stderr F 2024-06-17T15:28:01.509698379Z	LEVEL(-2)	multikueue/workload.go:160	Reconcile Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509961772Z stderr F 2024-06-17T15:28:01.50977886Z	LEVEL(-2)	multikueue/workload.go:185	Skip Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509967423Z stderr F 2024-06-17T15:28:01.509812271Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a6331098-1635-47f2-a35a-718f73eda864", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510008473Z stderr F 2024-06-17T15:28:01.509816931Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510028353Z stderr F 2024-06-17T15:28:01.509859591Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.52873076Z stderr F 2024-06-17T15:28:01.528530457Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "76daee36-48c8-41ea-a75f-e2c88f3ef61d", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.528989573Z stderr F 2024-06-17T15:28:01.528844161Z	LEVEL(-2)	multikueue/workload.go:160	Reconcile Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.528998833Z stderr F 2024-06-17T15:28:01.528854941Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.529002803Z stderr F 2024-06-17T15:28:01.528905302Z	LEVEL(-2)	multikueue/workload.go:185	Skip Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.529200485Z stderr F 2024-06-17T15:28:01.529041383Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529208905Z stderr F 2024-06-17T15:28:01.529069784Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a55e8f9f-87c7-40ee-beab-b0f25a018d86", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529212665Z stderr F 2024-06-17T15:28:01.529100014Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.548056673Z stderr F 2024-06-17T15:28:01.547834571Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.548095684Z stderr F 2024-06-17T15:28:01.547926642Z	LEVEL(-3)	jobframework/reconciler.go:390	update reclaimable counts if implemented by the job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.56416916Z stderr F 2024-06-17T15:28:01.563885097Z	ERROR	jobframework/reconciler.go:400	Updating reclaimable pods	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.56419449Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.56419795Z stderr F 	/workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.56420105Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.56420359Z stderr F 	/workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.56420603Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.56420877Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.56421138Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.5642136Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.564216191Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564218841Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564221141Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564223411Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.564226091Z stderr F 2024-06-17T15:28:01.564014328Z	ERROR	controller/controller.go:329	Reconciler error	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.564228601Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.564230811Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.564255761Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564258331Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564260651Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564262871Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.569613573Z stderr F 2024-06-17T15:28:01.56942685Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.569639643Z stderr F 2024-06-17T15:28:01.569526852Z	LEVEL(-3)	jobframework/reconciler.go:390	update reclaimable counts if implemented by the job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.57802327Z stderr F 2024-06-17T15:28:01.577808728Z	LEVEL(-3)	pod/expectations.go:59	Observed UID	{"pod": {"name":"replacement-for-default-priority-group-1","namespace":"pod-e2e-hm8rn"}, "store": "finalizedPods", "key": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "uid": "97a58d62-5b4b-4cdc-b17f-c6a8d49640b2"}
2024-06-17T15:28:01.587391209Z stderr F 2024-06-17T15:28:01.587178517Z	ERROR	jobframework/reconciler.go:400	Updating reclaimable pods	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58742253Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.58742704Z stderr F 	/workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.58743087Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.58743446Z stderr F 	/workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.58743766Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.58744038Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.5874432Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58744607Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.58744881Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.5874517Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58745479Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874577Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.58746151Z stderr F 2024-06-17T15:28:01.587283298Z	ERROR	controller/controller.go:329	Reconciler error	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58746463Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58746712Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.58746939Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.58747168Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58747391Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874763Z stderr F 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it seem to you @mbobrovskyi that extended time could save us here?

}
})

Expand Down