Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add a condition to report when hub and agent clock out of sync. #312

Conversation

xuezhaojun
Copy link
Member

@xuezhaojun xuezhaojun commented Nov 14, 2023

Summary

When hub and agent clock out of sync, the managed cluster status will show as 'Unknown' even the agent functions well.

Here we add a new condition to report clock out-of-sync, when we detect the clock of hub and a agent is not synced.

The strategy:
Every time hub find the renewTime updated, we compare the now with renewTime. The 2 values should close to each other, if the now is faster than renewTime too much (over a lease duration), we will report clock out-of-sync.

depends on: open-cluster-management-io/api#296

Related issue(s)

Fixes #

@openshift-ci openshift-ci bot requested review from elgnay and skeeey November 14, 2023 06:44
@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch 7 times, most recently from cee6ba2 to 7423aad Compare November 14, 2023 07:45
Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: 52 lines in your changes are missing coverage. Please review.

Comparison is base (f89d535) 61.75% compared to head (09fe3b2) 61.63%.
Report is 1 commits behind head on main.

Files Patch % Lines
pkg/registration/hub/lease/clocksynccontroller.go 42.85% 47 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #312      +/-   ##
==========================================
- Coverage   61.75%   61.63%   -0.13%     
==========================================
  Files         132      133       +1     
  Lines       13992    14083      +91     
==========================================
+ Hits         8641     8680      +39     
- Misses       4585     4632      +47     
- Partials      766      771       +5     
Flag Coverage Δ
unit 61.63% <42.85%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skeeey
Copy link
Member

skeeey commented Nov 14, 2023

I remembered we intentionally didn't use local time here, because we want to keep the time sync between the managed cluster and hub cluster

/cc @qiujian16

@openshift-ci openshift-ci bot requested a review from qiujian16 November 14, 2023 08:03
@xuezhaojun
Copy link
Member Author

I remembered we intentionally didn't use local time here, because we want to keep the time sync between the managed cluster and hub cluster

/cc @qiujian16

That's new to me, do we know the motivation of keeping clocks sync between hub and agents?
And if we intentionally avoid using local time, we should mention this as a prerequisite of using ocm.

@xuezhaojun
Copy link
Member Author

xuezhaojun commented Nov 14, 2023

Find the comment: open-cluster-management-io/registration#40 (comment)

It seems the initial implement is based on probeTimestamp but we change to using observedLease directly in this PR: open-cluster-management-io/registration#43

@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch from 7423aad to 1dd5d7d Compare November 15, 2023 08:14
@xuezhaojun xuezhaojun changed the title 🐛 Fix "unavailible" when hub and agent clock out of sync. 🐛 Add a condition to report when hub and agent clock out of sync. Nov 15, 2023
@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch 2 times, most recently from 37f0bc4 to d6825f2 Compare November 16, 2023 02:07
@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch 2 times, most recently from a1678bc to 2c0fb11 Compare November 16, 2023 06:37
@xuezhaojun
Copy link
Member Author

Hi, @qiujian16 , I tried the bareInformer approach as well, does that means:

  1. we need a independent controller watch leaseinformer using the following bare setting:
 informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
        UpdateFunc: func(oldObj, newObj interface{}) {
            oldLease := oldObj.(*corev1.Lease)
            newLease := newObj.(*corev1.Lease)
            if oldLease.Spec.RenewTime.Before(newLease.Spec.RenewTime) {
              c.equeue(newLease)
            }
        },
    })
  1. In the new controller, every reconcile is triggered by a renewTime update event, and when it occurs, we assume the now on the agent-side which is lease's renewTime, should be close to(in 1 leaseDuration) the now on the hub-side which we can get by time.Now()

The concern of this approach is, the controller has a default 10 hours resync setting, this resync is not triggered by the renew update event.

This may cause misreport of out-of-sync condition.

@qiujian16
Copy link
Member

I think you should have a separated controller to update this condition. The reconcile logic is not quite the same as what lease did before.

@xuezhaojun
Copy link
Member Author

I think you should have a separated controller to update this condition. The reconcile logic is not quite the same as what lease did before.

Got it, do you think we can seprete a controller but still use savedLease approach to avoid misreport by 10 hours resync of controller-runtime?

@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch from 2c0fb11 to 6d81edb Compare November 16, 2023 10:19
@xuezhaojun xuezhaojun changed the title 🐛 Add a condition to report when hub and agent clock out of sync. ✨ Add a condition to report when hub and agent clock out of sync. Nov 16, 2023
@qiujian16
Copy link
Member

My understanding is if you make it in the different controller, you would not need a local cache. isn't it?

@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch from 6d81edb to cb1c709 Compare November 16, 2023 15:56
@xuezhaojun xuezhaojun closed this Nov 20, 2023
@xuezhaojun xuezhaojun reopened this Nov 20, 2023
@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch 6 times, most recently from d284fc8 to 8218bb3 Compare November 21, 2023 02:00
Copy link
Member

@qiujian16 qiujian16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

/assign @elgnay

Signed-off-by: xuezhaojun <zxue@redhat.com>
@xuezhaojun xuezhaojun force-pushed the fix-hub-agent-clock-out-of-sync branch from 8218bb3 to 09fe3b2 Compare November 30, 2023 05:51
Copy link
Contributor

openshift-ci bot commented Nov 30, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qiujian16, xuezhaojun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qiujian16
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Dec 5, 2023
@openshift-merge-bot openshift-merge-bot bot merged commit 5884bc5 into open-cluster-management-io:main Dec 5, 2023
13 checks passed
@xuezhaojun xuezhaojun deleted the fix-hub-agent-clock-out-of-sync branch December 5, 2023 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants