Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multikueue] Cluster connection monitoring and reconnect. #1806

Merged
merged 2 commits into from
Mar 6, 2024

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Mar 5, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

[multikueue] Cluster connection monitoring and reconnect. Try to reconnect to the worker cluster when any of its watch loops ends.

Which issue(s) this PR fixes:

Fix #1787
Relates to #693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Add worker connection monitoring and reconnect

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 5, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2024
Copy link

netlify bot commented Mar 5, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 5a6d05b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65e87f511e86ee00089052a7

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

I tested that this PR fixes #1787. Please note that in the PR description.

The watches are closed after 30min (as indicated by the API server (--min-request-timeout param). They are restarted immediately. The workloads are admitted:

{"level":"Level(-2)","ts":"2024-03-06T07:38:22.991836644Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T07:38:22.996334577Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.494177757Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.49427571Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","cluster":"multikueue-test-worker2"}
{"level":"error","ts":"2024-03-06T07:49:18.495236114Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.495294072Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.499614002Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T07:49:18.503074664Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.353744664Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.353846304Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","cluster":"multikueue-test-worker1"}
{"level":"error","ts":"2024-03-06T08:09:46.354781998Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io","jobKind":"/, Kind=","error":"not a workload","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.354827672Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.360420409Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T08:09:46.36434089Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.585347169Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","ctxErr":null}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.585450161Z","caller":"multikueue/multikueuecluster.go:209","msg":"Queue reconcile for reconnect","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io","cluster":"multikueue-test-worker2"}
{"level":"error","ts":"2024-03-06T08:21:44.586427722Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.58647663Z","caller":"multikueue/multikueuecluster.go:205","msg":"Watch ended","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","ctxErr":"context canceled"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.592139962Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"Workload.kueue.x-k8s.io"}
{"level":"Level(-2)","ts":"2024-03-06T08:21:44.59552363Z","caller":"multikueue/multikueuecluster.go:196","msg":"Starting watch","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet"}

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

I'm a little bit curious about this error line logged every ~30min during regular operation:

{"level":"error","ts":"2024-03-06T07:49:18.495236114Z","caller":"multikueue/multikueuecluster.go:200","msg":"Cannot get workload key","clusterName":"multikueue-test-worker2","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","jobKind":"/, Kind=","error":"not a jobset","stacktrace":"sigs.k8s.io/kueue/pkg/controller/admissionchecks/multikueue.(*remoteClient).startWatcher.func1\n\t/workspace/pkg/controller/admissionchecks/multikueue/multikueuecluster.go:200"}

is this desired / can we do something about it?

EDIT: if we can not log this as error I think it is a win, but I'm happy to leave it for follow up if there are any complications.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm. Big plus for the e2e test.

My main ask would be to try to delegate the backoff to the built-in mechanisms, if feasible. If not feasible please describe why.


// retryAfter returns an exponentially increasing interval between
// retryIncrement and 2^retryMaxSteps * retryIncrement
func retryAfter(failedAttempts uint) time.Duration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would leave the backoff calculations to the built-in mechanisms, if feasible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being able to connect should not be seen as a reconcile error in my opinion, as it is not related to k8s state.
Also with this we maintain the control over the retry timing.

Copy link
Contributor

@mimowo mimowo Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being able to connect should not be seen as a reconcile error in my opinion, as it is not related to k8s state.

In most cases when Kueue sends a request from node to kube API server, and the API server drops the request we handle the failure as reconcile error.

However, this is "internal" (within cluster) connect error, maybe for external connect errors longer baseDelay is preferred indeed.

Also with this we maintain the control over the retry timing.

I see, I just have a preference for the KISS principle, we could introduce our timing mechanism later when proven to be needed.

However, on the fence here, because maybe communication with external cluster higher baseDelay is preferred indeed. WDYT @alculquicondor ?

If case we want to control the timings, is it much of complication to use the standard rate limiting queue, like for example here. Then we could pass the baseDelay and maxDelay. However, if this is a big complication, I'm fine as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use the Backoff class from k8s.io/apimachinery/pkg/util/wait

But on the nit side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc if you prefer to keep the custom timings, I'm fine just do a quick review if we can simplify the code by using the rate limiter or the package suggested by Aldo, so that we avoid reinventing the wheel. If you find this is the simplest approach. I'm ok, but please review the options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look at Backoff in k8s.io/apimachinery/pkg/util/wait but it's a bit overkill for what we are doing here.
Another thing I was thinking of was to just double the time since the cluster was declared inactive , so if it failed 5min ago , we try now and failed again retry in 5 min, the plus side of this being that we don't need to keep an internal state but the behavior is harder to predict.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Leaving the LGTM to @mimowo

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6d592aaa04fcc7d278bc685363066bb8e7677935

@k8s-ci-robot k8s-ci-robot merged commit 5a2e716 into kubernetes-sigs:main Mar 6, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Mar 6, 2024
@alculquicondor
Copy link
Contributor

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot

@alculquicondor: new pull request created: #1809

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

/release-note-edit

Added MultiKueue worker connection monitoring and reconnect. 

vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
…-sigs#1806)

* [multikueue] Cluster connection monitoring and reconnect.

* Review Remarks
@alculquicondor
Copy link
Contributor

/release-note-edit

MultiKueue: Add worker connection monitoring and reconnect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[multikueue] Jobs created on manager cluster remain "stuck" suspended
6 participants