-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 avoid cluster auto approve failed occasionally #388
🐛 avoid cluster auto approve failed occasionally #388
Conversation
/assign @qiujian16 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #388 +/- ##
==========================================
+ Coverage 61.48% 61.54% +0.06%
==========================================
Files 133 133
Lines 14087 14078 -9
==========================================
+ Hits 8662 8665 +3
+ Misses 4672 4664 -8
+ Partials 753 749 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I test this fix?
@@ -129,24 +137,25 @@ func (b *csrBootstrapReconciler) Reconcile(ctx context.Context, csr csrInfo, app | |||
return reconcileContinue, nil | |||
} | |||
|
|||
err := b.accpetCluster(ctx, clusterName) | |||
err := b.acceptCluster(ctx, clusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing typos fixes with a real change make it harder to review and make it hard to maintain the code later. If you revert this fix because it caused a regression, we will lose also the typo fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion, I will drop the typos fixes in this pr
return reconcileStop, nil | ||
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR | ||
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName) | ||
return reconcileRequeue, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we have a CSR but we cannot find the managed cluster we queue the request and will check again after some delay?
I expect that if this reconciler depends on a CSR and ManagedCluster, it will watch both resources. If the ManagedCluster is missing, you can do nothing, and once the ManagedCluster appears, you can approve the CSR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a separated controller to accept the cluster. All a separated controller to watch both csr and cluster. Requeue here may cause an infinite loop here, and may block the normal renew process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use a separated controller to handle cluster, and make this csr controller only handle the CSR , this will be more clear
// Current spoke cluster not found, could have been deleted, do nothing. | ||
return reconcileStop, nil | ||
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR | ||
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning will be useful!
} | ||
|
||
b.eventRecorder.Eventf("ManagedClusterAutoApproved", "spoke cluster %q is auto approved.", clusterName) | ||
return reconcileStop, nil | ||
} | ||
|
||
func (b *csrBootstrapReconciler) accpetCluster(ctx context.Context, managedClusterName string) error { | ||
func (b *csrBootstrapReconciler) acceptCluster(ctx context.Context, managedClusterName string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another typo that should not be part of this fix.
@@ -78,15 +85,15 @@ func (r *csrRenewalReconciler) Reconcile(ctx context.Context, csr csrInfo, appro | |||
// Authorize whether the current spoke agent has been authorized to renew its csr. | |||
allowed, err := authorize(ctx, r.kubeClient, csr) | |||
if err != nil { | |||
return reconcileContinue, err | |||
return reconcileRequeue, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to requeue here?
return reconcileStop, nil | ||
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR | ||
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName) | ||
return reconcileRequeue, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a separated controller to accept the cluster. All a separated controller to watch both csr and cluster. Requeue here may cause an infinite loop here, and may block the normal renew process.
1116678
to
29a3eeb
Compare
If you want to test with this pr, you can use my pr's repo to build your own images, then use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this change correctly, the only task the managedCluster controller need to do is to set hubAcceptsClient to true once?
The CSR approval is not related to this and will work correctly even if the CSR is found before the managed cluster is created?
@@ -103,6 +110,15 @@ func (c *managedClusterController) sync(ctx context.Context, syncCtx factory.Syn | |||
} | |||
|
|||
if !managedCluster.Spec.HubAcceptsClient { | |||
// If the ManagedClusterAutoApproval feature is enabled, we automatically accept this cluster and | |||
// add an annotation to mark its initial join as accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does is not clear about the purpose of the annotation.
Based on the new test, we want to approve the cluster once on the first time we discover it, so the cluster admin can deny the cluster. This should be explained in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this make it more clear
|
||
func (c *managedClusterController) acceptCluster(ctx context.Context, managedClusterName string) error { | ||
// TODO support patching both annotations and spec simultaneously in the patcher | ||
patch := fmt.Sprintf("{\"metadata\":{\"annotations\":{\"%s\":\"\"}},\"spec\":{\"hubAcceptsClient\":true}}", initialJoinAcceptedAnnotationKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but hard to read and error prone. Maybe use a raw string literal to avoid the extra " escapes?
patch := fmt.Sprintf(`{"metadata":{"annotations":{"%s":""}},"spec":{"hubAcceptsClient":true}}`)
We don't have a value for the annotation so we use an empty string - maybe we can add a timestamp as the value to have more context? or the name of the controller adding this annotation? This can make it easier to debug issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I think a timestamp may be more help
return err | ||
} | ||
|
||
if _, ok := cluster.Annotations["open-cluster-management.io/initial-join-accepted"]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this test, it seems that the purpose of the annotation is to approve the cluster automatically only once, so if the cluster is denied manually it will not approved automatically again. So the annotation can be renamed to "automatically-approved".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be used this annotation
open-cluster-management.io/automaticlly-accpected-by-managedclusterctrl=<the accept time with RFC3339>
// If the ManagedClusterAutoApproval feature is enabled, we automatically accept this cluster and | ||
// add an annotation to mark its initial join as accepted. | ||
if features.HubMutableFeatureGate.Enabled(ocmfeature.ManagedClusterAutoApproval) { | ||
// Only accept this cluster upon its initial join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not needed, the comment above can explain the cases we want to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
you are correct, the managedCluster controller only set hubAcceptsClient to true once
yes, with this, the CSR approval process and managed cluster accept process will be independent |
b3a2bc3
to
967fecc
Compare
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but the annotation can be nicer, see comment.
I tested the previous version, the original issue is fixed.
I can do more testing when the PR is ready for merge.
"open-cluster-management.io/ocm/pkg/registration/helpers" | ||
"open-cluster-management.io/ocm/pkg/registration/hub/manifests" | ||
) | ||
|
||
const clusterAcceptedAnnotationKey = "open-cluster-management.io/automatically-accepted-by-managedclusterctrl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "by-managedclusterctrl" is helpful, since I don't know which pod is hosting the managed cluster controller. Maybe this is better done by using "open-cluster-management.io/automatically-accepted-by" annotation, and use json value:
{"namespace":"open-cluster-management-hub","deploy":"cluster-manager-registration-controller","time":"2024-04-02 19:32:25+03:00"}
But I'm not sure about this json value, it is pretty ugly. Maybe it is better to use more annotations instead:
$ kubectl get managedcluster/dr1 -n dr1 --context hub -o yaml
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
annotations:
open-cluster-management.io/automatically-accepted-on: "2024-04-02 19:32:25+03:00"
open-cluster-management.io/controlled-by: "cluster-manager-registration-controller"
creationTimestamp: "2024-04-02T16:30:29Z"
finalizers:
- open-cluster-management.io/managedclusterrole
- managedclusterinfo.finalizers.open-cluster-management.io
- cluster.open-cluster-management.io/api-resource-cleanup
generation: 4
labels:
cluster.open-cluster-management.io/clusterset: default
feature.open-cluster-management.io/addon-application-manager: available
feature.open-cluster-management.io/addon-config-policy-controller: available
feature.open-cluster-management.io/addon-governance-policy-framework: available
feature.open-cluster-management.io/addon-work-manager: available
name: dr1
name: dr1
resourceVersion: "1539"
uid: 4c0e2a55-7907-4118-b67b-a3189285984b
...
Anyway, for this PR I think we can go with "automatically-accepted-on" and timestamp, and think more later how we make it easier to find the relevant controller for resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's go with "automatically-accepted-on" and timestamp firstly
@nirs: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nirs, skeeey The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
"open-cluster-management.io/ocm/pkg/registration/helpers" | ||
"open-cluster-management.io/ocm/pkg/registration/hub/manifests" | ||
) | ||
|
||
const clusterAcceptedAnnotationKey = "open-cluster-management.io/automatically-accepted-on" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some comment indicating this is an internal annotation, it is not to expected to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
I tested the latest version, 250 deploys succeeded. During this run, found another clusteradm join issue: open-cluster-management-io/clusteradm#410 |
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Wei Liu <liuweixa@redhat.com>
/lgtm |
fd2d0eb
into
open-cluster-management-io:main
…nt-io#388) Signed-off-by: Wei Liu <liuweixa@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
…nt-io#388) Signed-off-by: Wei Liu <liuweixa@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
I pushed images with fix[1] for issue[2] to my private repo. [1] open-cluster-management-io/ocm#388 [2] open-cluster-management-io/ocm#386 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Summary
Remove the accept cluster logic from the CSR controller, and put them in the managed cluster controller
Related issue(s)
Fixes #386