Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multikueue] Cluster connection monitoring and reconnect. #1806
[multikueue] Cluster connection monitoring and reconnect. #1806
Changes from 1 commit
8d28a93
5a6d05b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I would leave the backoff calculations to the built-in mechanisms, if feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being able to connect should not be seen as a reconcile error in my opinion, as it is not related to k8s state.
Also with this we maintain the control over the retry timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases when Kueue sends a request from node to kube API server, and the API server drops the request we handle the failure as reconcile error.
However, this is "internal" (within cluster) connect error, maybe for external connect errors longer baseDelay is preferred indeed.
I see, I just have a preference for the KISS principle, we could introduce our timing mechanism later when proven to be needed.
However, on the fence here, because maybe communication with external cluster higher
baseDelay
is preferred indeed. WDYT @alculquicondor ?If case we want to control the timings, is it much of complication to use the standard rate limiting queue, like for example here. Then we could pass the
baseDelay
andmaxDelay
. However, if this is a big complication, I'm fine as is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use the Backoff class from k8s.io/apimachinery/pkg/util/wait
But on the nit side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trasc if you prefer to keep the custom timings, I'm fine just do a quick review if we can simplify the code by using the rate limiter or the package suggested by Aldo, so that we avoid reinventing the wheel. If you find this is the simplest approach. I'm ok, but please review the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look at
Backoff
in k8s.io/apimachinery/pkg/util/wait but it's a bit overkill for what we are doing here.Another thing I was thinking of was to just double the time since the cluster was declared inactive , so if it failed 5min ago , we try now and failed again retry in 5 min, the plus side of this being that we don't need to keep an internal state but the behavior is harder to predict.