[multikueue] Cluster connection monitoring and reconnect. #1806

trasc · 2024-03-05T15:18:08Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

[multikueue] Cluster connection monitoring and reconnect. Try to reconnect to the worker cluster when any of its watch loops ends.

Which issue(s) this PR fixes:

Fix #1787
Relates to #693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Add worker connection monitoring and reconnect

netlify · 2024-03-05T15:18:26Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`5a6d05b`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65e87f511e86ee00089052a7

mimowo · 2024-03-06T08:37:39Z

I tested that this PR fixes #1787. Please note that in the PR description.

The watches are closed after 30min (as indicated by the API server (--min-request-timeout param). They are restarted immediately. The workloads are admitted:

{"level":"Level(-2)","ts":"2024-03-06T07:38:22.991836644Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T07:38:22.996334577Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.494177757Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.49427571Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","cluster":"multikueue-test-worker2"}
{"level":"error","ts":"2024-03-06T07:49:18.495236114Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.495294072Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.499614002Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.503074664Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.353744664Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.353846304Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","cluster":"multikueue-test-worker1"}
{"level":"error","ts":"2024-03-06T08:09:46.354781998Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io","jobKind":"/, Kind=","error":"not a workload","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.354827672Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.360420409Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.36434089Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.585347169Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.585450161Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","cluster":"multikueue-test-worker2"}
{"level":"error","ts":"2024-03-06T08:21:44.586427722Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.58647663Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.592139962Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.59552363Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}

mimowo · 2024-03-06T08:39:35Z

I'm a little bit curious about this error line logged every ~30min during regular operation:

{"level":"error","ts":"2024-03-06T07:49:18.495236114Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}

is this desired / can we do something about it?

EDIT: if we can not log this as error I think it is a win, but I'm happy to leave it for follow up if there are any complications.

mimowo

Overall lgtm. Big plus for the e2e test.

My main ask would be to try to delegate the backoff to the built-in mechanisms, if feasible. If not feasible please describe why.

mimowo · 2024-03-06T08:40:48Z

pkg/controller/admissionchecks/multikueue/multikueuecluster.go

+
+// retryAfter returns an exponentially increasing interval between
+// retryIncrement and 2^retryMaxSteps * retryIncrement
+func retryAfter(failedAttempts uint) time.Duration {


Ideally, I would leave the backoff calculations to the built-in mechanisms, if feasible.

Not being able to connect should not be seen as a reconcile error in my opinion, as it is not related to k8s state.
Also with this we maintain the control over the retry timing.

Not being able to connect should not be seen as a reconcile error in my opinion, as it is not related to k8s state.

In most cases when Kueue sends a request from node to kube API server, and the API server drops the request we handle the failure as reconcile error.

However, this is "internal" (within cluster) connect error, maybe for external connect errors longer baseDelay is preferred indeed.

Also with this we maintain the control over the retry timing.

I see, I just have a preference for the KISS principle, we could introduce our timing mechanism later when proven to be needed.

However, on the fence here, because maybe communication with external cluster higher baseDelay is preferred indeed. WDYT @alculquicondor ?

If case we want to control the timings, is it much of complication to use the standard rate limiting queue, like for example here. Then we could pass the baseDelay and maxDelay. However, if this is a big complication, I'm fine as is.

You could also use the Backoff class from k8s.io/apimachinery/pkg/util/wait

But on the nit side.

@trasc if you prefer to keep the custom timings, I'm fine just do a quick review if we can simplify the code by using the rate limiter or the package suggested by Aldo, so that we avoid reinventing the wheel. If you find this is the simplest approach. I'm ok, but please review the options.

I did look at Backoff in k8s.io/apimachinery/pkg/util/wait but it's a bit overkill for what we are doing here.
Another thing I was thinking of was to just double the time since the cluster was declared inactive , so if it failed 5min ago , we try now and failed again retry in 5 min, the plus side of this being that we don't need to keep an internal state but the behavior is harder to predict.

pkg/controller/admissionchecks/multikueue/multikueuecluster.go

pkg/controller/admissionchecks/multikueue/multikueuecluster_test.go

alculquicondor

/approve

Leaving the LGTM to @mimowo

k8s-ci-robot · 2024-03-06T15:26:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/OWNERS~~ [alculquicondor]
~~test/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor

/lgtm

k8s-ci-robot · 2024-03-06T15:51:53Z

LGTM label has been added.

Git tree hash: 6d592aaa04fcc7d278bc685363066bb8e7677935

alculquicondor · 2024-03-06T16:20:11Z

/cherry-pick release-0.6

k8s-infra-cherrypick-robot · 2024-03-06T16:20:49Z

@alculquicondor: new pull request created: #1809

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tenzen-y · 2024-03-14T19:08:57Z

/release-note-edit

Added MultiKueue worker connection monitoring and reconnect.

…-sigs#1806) * [multikueue] Cluster connection monitoring and reconnect. * Review Remarks

alculquicondor · 2024-05-29T20:23:46Z

/release-note-edit

MultiKueue: Add worker connection monitoring and reconnect

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 5, 2024

k8s-ci-robot requested review from kerthcet and mimowo March 5, 2024 15:18

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2024

trasc mentioned this pull request Mar 5, 2024

[multikueue] Jobs created on manager cluster remain "stuck" suspended #1787

Closed

trasc force-pushed the multikueue-reconnect branch from 2a3758d to 5b95297 Compare March 6, 2024 06:42

[multikueue] Cluster connection monitoring and reconnect.

8d28a93

trasc force-pushed the multikueue-reconnect branch from 5b95297 to 8d28a93 Compare March 6, 2024 08:15

mimowo reviewed Mar 6, 2024

View reviewed changes

Review Remarks

5a6d05b

alculquicondor reviewed Mar 6, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024

alculquicondor reviewed Mar 6, 2024

View reviewed changes

k8s-ci-robot assigned alculquicondor Mar 6, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024

k8s-ci-robot merged commit 5a2e716 into kubernetes-sigs:main Mar 6, 2024
14 checks passed

k8s-ci-robot added this to the v0.7 milestone Mar 6, 2024

k8s-infra-cherrypick-robot mentioned this pull request Mar 6, 2024

[release-0.6] [multikueue] Cluster connection monitoring and reconnect. #1809

Merged

trasc deleted the multikueue-reconnect branch March 6, 2024 16:29

mimowo mentioned this pull request Mar 8, 2024

[MultiKueue] in around 30min intervals error message is logged during normal operation #1814

Closed

vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024

[multikueue] Cluster connection monitoring and reconnect. (kubernetes…

898e39c

…-sigs#1806) * [multikueue] Cluster connection monitoring and reconnect. * Review Remarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multikueue] Cluster connection monitoring and reconnect. #1806

[multikueue] Cluster connection monitoring and reconnect. #1806

trasc commented Mar 5, 2024 •

edited by k8s-ci-robot

Loading

netlify bot commented Mar 5, 2024 •

edited

Loading

mimowo commented Mar 6, 2024

mimowo commented Mar 6, 2024 •

edited

Loading

mimowo left a comment

mimowo Mar 6, 2024

trasc Mar 6, 2024

mimowo Mar 6, 2024 •

edited

Loading

alculquicondor Mar 6, 2024

mimowo Mar 6, 2024

trasc Mar 6, 2024

alculquicondor left a comment

k8s-ci-robot commented Mar 6, 2024

alculquicondor left a comment

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

k8s-infra-cherrypick-robot commented Mar 6, 2024

tenzen-y commented Mar 14, 2024

alculquicondor commented May 29, 2024

[multikueue] Cluster connection monitoring and reconnect. #1806

[multikueue] Cluster connection monitoring and reconnect. #1806

Conversation

trasc commented Mar 5, 2024 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Mar 5, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Mar 6, 2024

mimowo commented Mar 6, 2024 • edited Loading

mimowo left a comment

Choose a reason for hiding this comment

mimowo Mar 6, 2024

Choose a reason for hiding this comment

trasc Mar 6, 2024

Choose a reason for hiding this comment

mimowo Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

alculquicondor Mar 6, 2024

Choose a reason for hiding this comment

mimowo Mar 6, 2024

Choose a reason for hiding this comment

trasc Mar 6, 2024

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 6, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

k8s-infra-cherrypick-robot commented Mar 6, 2024

tenzen-y commented Mar 14, 2024

alculquicondor commented May 29, 2024

trasc commented Mar 5, 2024 •

edited by k8s-ci-robot

Loading

netlify bot commented Mar 5, 2024 •

edited

Loading

mimowo commented Mar 6, 2024 •

edited

Loading

mimowo Mar 6, 2024 •

edited

Loading