-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase waiting time to verify workload completion. #2463
Conversation
The time could be increased due to the fact that is out of Kueue's control. As the time is controlled by the Kubernetes.
Hi @mszadkow. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/ok-to-test |
@@ -541,7 +541,7 @@ var _ = ginkgo.Describe("Pod groups", func() { | |||
var p corev1.Pod | |||
g.Expect(k8sClient.Get(ctx, replKey, &p)).To(gomega.Succeed()) | |||
g.Expect(p.Status.Phase).To(gomega.Equal(corev1.PodSucceeded)) | |||
}, util.Timeout, util.Interval).Should(gomega.Succeed()) | |||
}, util.LongTimeout, util.Interval).Should(gomega.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, why it's take so much time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understood, time increase in here is just to ensure our testing infra had no "hiccup", because it happened in the past that it was overwhelmed.
Also it doesn't affect the happy path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but why we have "hiccup"? How much time is needed in a successful case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked on daily that it is most probably resource exhaustion or something with test infra, but I dug deeper...
Successful takes something around 1-2 seconds
e.g.
�[1mSTEP:�[0m Verify the replacement pods of the default priority workload complete �[38;5;243m@ 06/21/24 17:12:58.032�[0m
�[1mSTEP:�[0m Verify the default priority workload is finished �[38;5;243m@ 06/21/24 17:13:00.065�[0m
And all events (of Kueue) came between the eventually expired:
2024-06-21T17:13:00.029650262Z stderr F 2024-06-21T17:13:00.02943541Z LEVEL(-2) workload-reconciler core/workload_controller.go:649 Workload update event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished", "prevStatus": "admitted", "clusterQueue": "cq"}
2024-06-21T17:13:00.029797574Z stderr F 2024-06-21T17:13:00.029638912Z LEVEL(-3) localqueue-reconciler core/localqueue_controller.go:199 Got Workload event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}}
2024-06-21T17:13:00.029807114Z stderr F 2024-06-21T17:13:00.029726863Z LEVEL(-2) cluster-queue-reconciler core/clusterqueue_controller.go:330 Got generic event {"obj": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "kind": "/, Kind="}
2024-06-21T17:13:00.057307688Z stderr F 2024-06-21T17:13:00.057121706Z LEVEL(-2) workload-reconciler core/workload_controller.go:649 Workload update event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished",
However for failed run:
We didn't get the events at all (eventually expired - 15:28:01.191
), errors below.
The workload was admitted only after expected time, so no chance to finish as teardown kicked in.
2024-06-17T15:28:01.509816691Z stderr F 2024-06-17T15:28:01.509626579Z LEVEL(-2) workload-reconciler core/workload_controller.go:649 Workload update event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.509941372Z stderr F 2024-06-17T15:28:01.509698379Z LEVEL(-2) multikueue/workload.go:160 Reconcile Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509961772Z stderr F 2024-06-17T15:28:01.50977886Z LEVEL(-2) multikueue/workload.go:185 Skip Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509967423Z stderr F 2024-06-17T15:28:01.509812271Z LEVEL(-2) core/workload_controller.go:144 Reconciling Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a6331098-1635-47f2-a35a-718f73eda864", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510008473Z stderr F 2024-06-17T15:28:01.509816931Z LEVEL(-3) localqueue-reconciler core/localqueue_controller.go:199 Got Workload event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510028353Z stderr F 2024-06-17T15:28:01.509859591Z LEVEL(-2) cluster-queue-reconciler core/clusterqueue_controller.go:330 Got generic event {"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.52873076Z stderr F 2024-06-17T15:28:01.528530457Z LEVEL(-2) jobframework/reconciler.go:308 Reconciling Job {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "76daee36-48c8-41ea-a75f-e2c88f3ef61d", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.528989573Z stderr F 2024-06-17T15:28:01.528844161Z LEVEL(-2) multikueue/workload.go:160 Reconcile Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.528998833Z stderr F 2024-06-17T15:28:01.528854941Z LEVEL(-2) workload-reconciler core/workload_controller.go:649 Workload update event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.529002803Z stderr F 2024-06-17T15:28:01.528905302Z LEVEL(-2) multikueue/workload.go:185 Skip Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.529200485Z stderr F 2024-06-17T15:28:01.529041383Z LEVEL(-3) localqueue-reconciler core/localqueue_controller.go:199 Got Workload event {"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529208905Z stderr F 2024-06-17T15:28:01.529069784Z LEVEL(-2) core/workload_controller.go:144 Reconciling Workload {"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a55e8f9f-87c7-40ee-beab-b0f25a018d86", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529212665Z stderr F 2024-06-17T15:28:01.529100014Z LEVEL(-2) cluster-queue-reconciler core/clusterqueue_controller.go:330 Got generic event {"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.548056673Z stderr F 2024-06-17T15:28:01.547834571Z LEVEL(-2) jobframework/reconciler.go:308 Reconciling Job {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.548095684Z stderr F 2024-06-17T15:28:01.547926642Z LEVEL(-3) jobframework/reconciler.go:390 update reclaimable counts if implemented by the job {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.56416916Z stderr F 2024-06-17T15:28:01.563885097Z ERROR jobframework/reconciler.go:400 Updating reclaimable pods {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.56419449Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.56419795Z stderr F /workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.56420105Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.56420359Z stderr F /workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.56420603Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.56420877Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.56421138Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.5642136Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.564216191Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564218841Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564221141Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564223411Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.564226091Z stderr F 2024-06-17T15:28:01.564014328Z ERROR controller/controller.go:329 Reconciler error {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.564228601Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.564230811Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.564255761Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564258331Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564260651Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564262871Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.569613573Z stderr F 2024-06-17T15:28:01.56942685Z LEVEL(-2) jobframework/reconciler.go:308 Reconciling Job {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.569639643Z stderr F 2024-06-17T15:28:01.569526852Z LEVEL(-3) jobframework/reconciler.go:390 update reclaimable counts if implemented by the job {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.57802327Z stderr F 2024-06-17T15:28:01.577808728Z LEVEL(-3) pod/expectations.go:59 Observed UID {"pod": {"name":"replacement-for-default-priority-group-1","namespace":"pod-e2e-hm8rn"}, "store": "finalizedPods", "key": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "uid": "97a58d62-5b4b-4cdc-b17f-c6a8d49640b2"}
2024-06-17T15:28:01.587391209Z stderr F 2024-06-17T15:28:01.587178517Z ERROR jobframework/reconciler.go:400 Updating reclaimable pods {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58742253Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.58742704Z stderr F /workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.58743087Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.58743446Z stderr F /workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.58743766Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.58744038Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.5874432Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58744607Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.58744881Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.5874517Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58745479Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874577Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.58746151Z stderr F 2024-06-17T15:28:01.587283298Z ERROR controller/controller.go:329 Reconciler error {"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58746463Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58746712Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.58746939Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.58747168Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58747391Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874763Z stderr F /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it seem to you @mbobrovskyi that extended time could save us here?
/test pull-kueue-test-scheduling-perf-main |
/cc |
We generally use /lgtm |
LGTM label has been added. Git tree hash: aab9fa18db0c616965505740dcc494f9626a6c2d
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mszadkow The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The time could be increased due to the fact that is out of Kueue's control.
As the process is controlled by the Kubernetes.
The change does not affect happy path.
What type of PR is this?
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #2429
Special notes for your reviewer:
Does this PR introduce a user-facing change?