Delay in managedCluster status update #117

josecastillolema · 2022-12-04T12:53:51Z

I have a managed cluster configured with a heartbeat of 60 seconds:

% oc get -o yaml managedcluster cluster1 | grep lease
  leaseDurationSeconds: 60

However, when I scale down the klusterlet-registration-agent in the managed cluster it takes almost 5 minutes for the cluster to be reported with Unknown availability:

% k scale deployment.apps/klusterlet-registration-agent --replicas 0
deployment.apps/klusterlet-registration-agent scaled
% time cat
cat  0.00s user 0.00s system 0% cpu 4:42.50 total

On the opposite, only 5/6 seconds for the cluster to be reported as Available when scaling up the agent:

% k scale deployment.apps/klusterlet-registration-agent --replicas 1
deployment.apps/klusterlet-registration-agent scaled
% time cat                                                          
cat  0.00s user 0.00s system 0% cpu 6.344 total

I would have expected a smaller time since the heartbeat is configured to 60 seconds.

Notes:

Both environments are kind clusters configured as per the Quick Start guide
Setting leaseDurationSeconds to a lower value (i.e.: 1 or 5 seconds) does not change the outcome

The text was updated successfully, but these errors were encountered:

josecastillolema · 2022-12-04T13:00:13Z

Logs from the hub cluster-manager-registration-controller:

I1204 12:27:21.129691       1 event.go:285] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"open-cluster-management-hub", Name:"open-cluster-management-hub", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ManagedClusterAvailableConditionUpdated' update managed cluster "cluster1" available condition to unknown, due to its lease is not updated constantly
I1204 12:27:21.185928       1 event.go:285] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"open-cluster-management-hub", Name:"open-cluster-management-hub", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ManagedClusterConditionAvailableUpdated' Update the original taints to the [{Key:cluster.open-cluster-management.io/unreachable Value: Effect:NoSelect TimeAdded:0001-01-01 00:00:00 +0000 UTC}]
I1204 12:30:29.673784       1 event.go:285] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"open-cluster-management-hub", Name:"open-cluster-management-hub", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ManagedClusterConditionAvailableUpdated' Update the original taints to the []

mikeshng · 2022-12-05T00:27:21Z

I am guessing the lease is being refreshed every 60 seconds but the hub that checks the lease is still on a 5 mins interval.

CC @qiujian16 @deads2k

qiujian16 · 2022-12-05T01:48:45Z

/assign @skeeey

skeeey · 2022-12-05T03:48:49Z

@josecastillolema

A managed cluster has a grace period for its unknown status, the grace period is leaseDurationSeconds x 5, so when the managed cluster stop to update its lease, by default, the cluster will become unknown after 5mins (the default leaseDurationSeconds is 60s), the grace period will help to tolerate some unexpected situations, e.g. the network latency, and once the managed cluster recover to update its lease, we expect the managed cluster become available immediately

And we support to reduce the leaseDurationSeconds to reduce the grace period (like you did), but unfortunately, it seems that there is a bug in current implement, I will take a look

thanks

josecastillolema · 2022-12-05T10:06:19Z

Thanks for the quick response!

…cluster-management-io#117) Signed-off-by: Jian Qiu <jqiu@redhat.com>

github-actions · 2023-07-14T02:07:34Z

Message to comment on stale issues. If none provided, will not mark issues stale

…s deleted (open-cluster-management-io#117) * Refresh external managed token secret if service account ns changes (open-cluster-management-io#458) Signed-off-by: zhujian <jiazhu@redhat.com> * 🐛 Refresh external managed token secret if service account is deleted (open-cluster-management-io#504) (open-cluster-management-io#78) * Refresh external managed token secret if service account is deleted * Debug e2e --------- Signed-off-by: zhujian <jiazhu@redhat.com> --------- Signed-off-by: zhujian <jiazhu@redhat.com>

openshift-ci bot assigned skeeey Dec 5, 2022

skeeey mentioned this issue Dec 5, 2022

Using requeue cluster instead of resync lease controller open-cluster-management-io/registration#288

Merged

josecastillolema mentioned this issue Jan 25, 2023

[OCM] Reduce managedCluster timeout to 60 secs krkn-chaos/krkn#378

Open

xuezhaojun pushed a commit to xuezhaojun/OCM that referenced this issue Mar 21, 2023

Add an integration test to remove a resource with orphanDelete (open-…

7aed123

…cluster-management-io#117) Signed-off-by: Jian Qiu <jqiu@redhat.com>

github-actions bot added the Stale label Jul 14, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay in managedCluster status update #117

Delay in managedCluster status update #117

josecastillolema commented Dec 4, 2022 •

edited

Loading

josecastillolema commented Dec 4, 2022

mikeshng commented Dec 5, 2022

qiujian16 commented Dec 5, 2022

skeeey commented Dec 5, 2022 •

edited

Loading

josecastillolema commented Dec 5, 2022

github-actions bot commented Jul 14, 2023

Delay in managedCluster status update #117

Delay in managedCluster status update #117

Comments

josecastillolema commented Dec 4, 2022 • edited Loading

josecastillolema commented Dec 4, 2022

mikeshng commented Dec 5, 2022

qiujian16 commented Dec 5, 2022

skeeey commented Dec 5, 2022 • edited Loading

josecastillolema commented Dec 5, 2022

github-actions bot commented Jul 14, 2023

josecastillolema commented Dec 4, 2022 •

edited

Loading

skeeey commented Dec 5, 2022 •

edited

Loading