Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 avoid cluster auto approve failed occasionally #388

Merged
merged 1 commit into from
Apr 7, 2024

Conversation

skeeey
Copy link
Member

@skeeey skeeey commented Mar 29, 2024

Summary

Remove the accept cluster logic from the CSR controller, and put them in the managed cluster controller

Related issue(s)

Fixes #386

@openshift-ci openshift-ci bot requested review from elgnay and ldpliu March 29, 2024 05:49
@skeeey
Copy link
Member Author

skeeey commented Mar 29, 2024

/assign @qiujian16

Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 61.54%. Comparing base (22501d8) to head (5c823d4).
Report is 1 commits behind head on main.

Files Patch % Lines
pkg/registration/hub/managedcluster/controller.go 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #388      +/-   ##
==========================================
+ Coverage   61.48%   61.54%   +0.06%     
==========================================
  Files         133      133              
  Lines       14087    14078       -9     
==========================================
+ Hits         8662     8665       +3     
+ Misses       4672     4664       -8     
+ Partials      753      749       -4     
Flag Coverage Δ
unit 61.54% <93.33%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I test this fix?

@@ -129,24 +137,25 @@ func (b *csrBootstrapReconciler) Reconcile(ctx context.Context, csr csrInfo, app
return reconcileContinue, nil
}

err := b.accpetCluster(ctx, clusterName)
err := b.acceptCluster(ctx, clusterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing typos fixes with a real change make it harder to review and make it hard to maintain the code later. If you revert this fix because it caused a regression, we will lose also the typo fixes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, I will drop the typos fixes in this pr

return reconcileStop, nil
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName)
return reconcileRequeue, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have a CSR but we cannot find the managed cluster we queue the request and will check again after some delay?

I expect that if this reconciler depends on a CSR and ManagedCluster, it will watch both resources. If the ManagedCluster is missing, you can do nothing, and once the ManagedCluster appears, you can approve the CSR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a separated controller to accept the cluster. All a separated controller to watch both csr and cluster. Requeue here may cause an infinite loop here, and may block the normal renew process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use a separated controller to handle cluster, and make this csr controller only handle the CSR , this will be more clear

// Current spoke cluster not found, could have been deleted, do nothing.
return reconcileStop, nil
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning will be useful!

}

b.eventRecorder.Eventf("ManagedClusterAutoApproved", "spoke cluster %q is auto approved.", clusterName)
return reconcileStop, nil
}

func (b *csrBootstrapReconciler) accpetCluster(ctx context.Context, managedClusterName string) error {
func (b *csrBootstrapReconciler) acceptCluster(ctx context.Context, managedClusterName string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another typo that should not be part of this fix.

@@ -78,15 +85,15 @@ func (r *csrRenewalReconciler) Reconcile(ctx context.Context, csr csrInfo, appro
// Authorize whether the current spoke agent has been authorized to renew its csr.
allowed, err := authorize(ctx, r.kubeClient, csr)
if err != nil {
return reconcileContinue, err
return reconcileRequeue, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to requeue here?

return reconcileStop, nil
// Current spoke cluster not found, it may not be created yet, requeue it until we cannot find its related CSR
b.eventRecorder.Warningf("ManagedClusterNotApproved", "spoke cluster %q is not found", clusterName)
return reconcileRequeue, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a separated controller to accept the cluster. All a separated controller to watch both csr and cluster. Requeue here may cause an infinite loop here, and may block the normal renew process.

@skeeey skeeey force-pushed the faa branch 5 times, most recently from 1116678 to 29a3eeb Compare April 1, 2024 11:06
@skeeey
Copy link
Member Author

skeeey commented Apr 1, 2024

How can I test this fix?

If you want to test with this pr, you can use my pr's repo to build your own images, then use clusteradm init --image-registry to specify the images

Copy link
Contributor

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this change correctly, the only task the managedCluster controller need to do is to set hubAcceptsClient to true once?

The CSR approval is not related to this and will work correctly even if the CSR is found before the managed cluster is created?

@@ -103,6 +110,15 @@ func (c *managedClusterController) sync(ctx context.Context, syncCtx factory.Syn
}

if !managedCluster.Spec.HubAcceptsClient {
// If the ManagedClusterAutoApproval feature is enabled, we automatically accept this cluster and
// add an annotation to mark its initial join as accepted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does is not clear about the purpose of the annotation.

Based on the new test, we want to approve the cluster once on the first time we discover it, so the cluster admin can deny the cluster. This should be explained in the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this make it more clear


func (c *managedClusterController) acceptCluster(ctx context.Context, managedClusterName string) error {
// TODO support patching both annotations and spec simultaneously in the patcher
patch := fmt.Sprintf("{\"metadata\":{\"annotations\":{\"%s\":\"\"}},\"spec\":{\"hubAcceptsClient\":true}}", initialJoinAcceptedAnnotationKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but hard to read and error prone. Maybe use a raw string literal to avoid the extra " escapes?

patch := fmt.Sprintf(`{"metadata":{"annotations":{"%s":""}},"spec":{"hubAcceptsClient":true}}`)

We don't have a value for the annotation so we use an empty string - maybe we can add a timestamp as the value to have more context? or the name of the controller adding this annotation? This can make it easier to debug issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I think a timestamp may be more help

return err
}

if _, ok := cluster.Annotations["open-cluster-management.io/initial-join-accepted"]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this test, it seems that the purpose of the annotation is to approve the cluster automatically only once, so if the cluster is denied manually it will not approved automatically again. So the annotation can be renamed to "automatically-approved".

Copy link
Member Author

@skeeey skeeey Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be used this annotation
open-cluster-management.io/automaticlly-accpected-by-managedclusterctrl=<the accept time with RFC3339>

// If the ManagedClusterAutoApproval feature is enabled, we automatically accept this cluster and
// add an annotation to mark its initial join as accepted.
if features.HubMutableFeatureGate.Enabled(ocmfeature.ManagedClusterAutoApproval) {
// Only accept this cluster upon its initial join.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed, the comment above can explain the cases we want to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

nirs added a commit to nirs/ramen that referenced this pull request Apr 1, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 1, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 1, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 1, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 2, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 2, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@skeeey
Copy link
Member Author

skeeey commented Apr 2, 2024

If I understand this change correctly, the only task the managedCluster controller need to do is to set hubAcceptsClient to true once?

you are correct, the managedCluster controller only set hubAcceptsClient to true once

The CSR approval is not related to this and will work correctly even if the CSR is found before the managed cluster is created?

yes, with this, the CSR approval process and managed cluster accept process will be independent

@skeeey skeeey force-pushed the faa branch 2 times, most recently from b3a2bc3 to 967fecc Compare April 2, 2024 05:09
nirs added a commit to nirs/ramen that referenced this pull request Apr 2, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Copy link
Contributor

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but the annotation can be nicer, see comment.

I tested the previous version, the original issue is fixed.

I can do more testing when the PR is ready for merge.

"open-cluster-management.io/ocm/pkg/registration/helpers"
"open-cluster-management.io/ocm/pkg/registration/hub/manifests"
)

const clusterAcceptedAnnotationKey = "open-cluster-management.io/automatically-accepted-by-managedclusterctrl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "by-managedclusterctrl" is helpful, since I don't know which pod is hosting the managed cluster controller. Maybe this is better done by using "open-cluster-management.io/automatically-accepted-by" annotation, and use json value:

{"namespace":"open-cluster-management-hub","deploy":"cluster-manager-registration-controller","time":"2024-04-02 19:32:25+03:00"}

But I'm not sure about this json value, it is pretty ugly. Maybe it is better to use more annotations instead:

$ kubectl get managedcluster/dr1 -n dr1 --context hub -o yaml
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  annotations:
    open-cluster-management.io/automatically-accepted-on: "2024-04-02 19:32:25+03:00"
    open-cluster-management.io/controlled-by: "cluster-manager-registration-controller"
  creationTimestamp: "2024-04-02T16:30:29Z"
  finalizers:
  - open-cluster-management.io/managedclusterrole
  - managedclusterinfo.finalizers.open-cluster-management.io
  - cluster.open-cluster-management.io/api-resource-cleanup
  generation: 4
  labels:
    cluster.open-cluster-management.io/clusterset: default
    feature.open-cluster-management.io/addon-application-manager: available
    feature.open-cluster-management.io/addon-config-policy-controller: available
    feature.open-cluster-management.io/addon-governance-policy-framework: available
    feature.open-cluster-management.io/addon-work-manager: available
    name: dr1
  name: dr1
  resourceVersion: "1539"
  uid: 4c0e2a55-7907-4118-b67b-a3189285984b
  ...

Anyway, for this PR I think we can go with "automatically-accepted-on" and timestamp, and think more later how we make it easier to find the relevant controller for resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's go with "automatically-accepted-on" and timestamp firstly

Copy link
Contributor

openshift-ci bot commented Apr 3, 2024

@nirs: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Apr 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nirs, skeeey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirs added a commit to nirs/ramen that referenced this pull request Apr 3, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
"open-cluster-management.io/ocm/pkg/registration/helpers"
"open-cluster-management.io/ocm/pkg/registration/hub/manifests"
)

const clusterAcceptedAnnotationKey = "open-cluster-management.io/automatically-accepted-on"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some comment indicating this is an internal annotation, it is not to expected to be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@nirs
Copy link
Contributor

nirs commented Apr 4, 2024

I tested the latest version, 250 deploys succeeded.

During this run, found another clusteradm join issue: open-cluster-management-io/clusteradm#410

nirs added a commit to nirs/ramen that referenced this pull request Apr 4, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 6, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Wei Liu <liuweixa@redhat.com>
@qiujian16
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Apr 7, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit fd2d0eb into open-cluster-management-io:main Apr 7, 2024
13 checks passed
qiujian16 pushed a commit to qiujian16/ocm that referenced this pull request Apr 7, 2024
nirs added a commit to nirs/ramen that referenced this pull request Apr 7, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
qiujian16 pushed a commit to qiujian16/ocm that referenced this pull request Apr 8, 2024
openshift-merge-bot bot pushed a commit that referenced this pull request Apr 8, 2024
Signed-off-by: Wei Liu <liuweixa@redhat.com>
Co-authored-by: Wei Liu <liuweixa@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 8, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 8, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/ramen that referenced this pull request Apr 9, 2024
I pushed images with fix[1] for issue[2] to my private repo.

[1] open-cluster-management-io/ocm#388
[2] open-cluster-management-io/ocm#386

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@skeeey skeeey deleted the faa branch April 26, 2024 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster auto approve could fail occasionally
3 participants