Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rollout helper functions #254

Conversation

haoqing0110
Copy link
Member

@haoqing0110 haoqing0110 commented Jul 20, 2023

An idea to treat the common rollout strategy API as a duck type, and provide some helper functions for the common rollout API.

  • The input would be rollout strategy, the placement decision tracker, and the resource status on each managed cluster
  • The output would be a list of clusters that need to update.

open-cluster-management-io/ocm#236
open-cluster-management-io/ocm#237

@haoqing0110
Copy link
Member Author

cc @qiujian16

cluster/v1alpha1/helpers.go Outdated Show resolved Hide resolved
objs map[string]RolloutStatus
}

func NewRolloutHandler(rolloutStrategy RolloutStrategy, pdTpdTracker clusterv1beta1.PlacementDecisionClustersTracker, objs map[string]RolloutStatus) *RolloutHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of map[string]RolloutStatus, will be a func(clusterName string) RolloutStatus betterr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good suggestion. Fixed.

}
}

func (r *RolloutHandler) GetRolloutCluster() []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not quite clear on what is the output, is this the list cluster that needs to be updated with current state? If so I think the the input should have

  1. the strategy
  2. the current state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the output.

if subclusters, ok := clusterGroups[key]; ok {
for _, cluster := range subclusters.UnsortedList() {
if status, exist := r.objs[cluster]; exist {
if canIAppendToResult(status, timeout) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canIAppendToResult is this actually isUpgrading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is upgrading, but also some failed but skipped clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need a better function name.

@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch 2 times, most recently from 1b2fcb6 to deb0d2c Compare July 24, 2023 06:55
Copy link
Member

@serngawy serngawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general sounds good and the class cluster/v1alpha1/helpers.go need unit test

var RolloutClock = clock.Clock(clock.RealClock{})
var maxTimeDuration = time.Duration(math.MaxInt64)

const (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all cases are upgrade, better names like
ToApply
Progressing
Succeed
Failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think about the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new name sounds good, have updated the status name.

@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch 2 times, most recently from f4462cd to 25c3d36 Compare July 25, 2023 10:20
}
}

// Return the strategy actual take effect and a list of cluster that needs to update with current state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RolloutHandler already have the RolloutStrategy struct that already has its rolloutType. Why do we have GetRolloutAllClusters, GetRolloutProgressive and GetRolloutProgressivePerGroup Functions as public ?
GetRolloutCluster Func should be enough ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, modify them as private func.

// placement decision tracker
pdTracker *clusterv1beta1.PlacementDecisionClustersTracker
// Return the rollout status on each managed cluster
clusterRolloutStatusFunc ClusterRolloutStatusFunc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the clusterRolloutStatusFunc should be implemented in the Consumer controller ? unit test should be good place for an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, clusterRolloutStatusFunc should be implemented by the Consumer. Add more UT.

@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch 3 times, most recently from bc40013 to ee0ce0c Compare July 25, 2023 13:51
@haoqing0110 haoqing0110 changed the title [WIP] rollout helper functions rollout helper functions Jul 26, 2023
@haoqing0110
Copy link
Member Author

/assign @qiujian16 @serngawy

Based on our offline discussion, the full code and UT are now added, plz help review and provide more comments.

@haoqing0110
Copy link
Member Author

/cc @mprahl

return &RolloutHandler{
rolloutStrategy: rolloutStrategy,
pdTracker: pdTracker,
clusterRolloutStatusFunc: clusterRolloutStatusFunc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clusterRolloutStatusFunc and rolloutStrategy should be the input of GetRolloutCluster, you can init the handler at the beginning, and each time just give handler the strategy and current status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the code.

Copy link
Member

@qiujian16 qiujian16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Need @serngawy and @mprahl to signoff

cluster/v1alpha1/helpers.go Outdated Show resolved Hide resolved

// The input is a duck type RolloutStrategy and a ClusterRolloutStatusFunc to return the rollout status on each managed cluster.
// Return the strategy actual take effect and a list of cluster that needs to update with current state.
func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, map[string]ClusterRolloutStatus, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Consumer API controller point of view; I think the ClusterRolloutStatus should contain the cluster groupIndex/Name ? How we can express that the Mandatory/DecisionGroups clusters are done or progressing ? Otherwise the logic of connecting the cluster to the decisionGroup need to be maintained in the controller as well. What do you think ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense I think. Can have GroupKey added, only for the clusters need to update eg,

type ClusterRolloutStatus struct {
        // cluster group name and group key
        GroupKey GroupKey
	// rollout status
	Status RolloutStatus
	// the last transition time of rollout status
	LastTransitionTime *time.Time
}

func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, map[string]ClusterRolloutStatus, error) {...

Or maybe a whole picture, eg,

type ClustersRolloutStatus map[string]ClusterRolloutStatus

type RolloutStatus struct {
	MandatoryGroupsSucceed bool
       // progressive with mandatory groups or progressive per group will return this field
        DecisionGroupsStatus map[GroupKey]ClustersRolloutStatus 
        // all or progressive non mandatory group clusters will return in this field
	ClustersStatus ClustersRolloutStatus 
        // clusters to rollout, all the rollout strategy will return this field
	ClustersToRollout ClustersRolloutStatus
}

func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutStatus, error) {...

can discuss more in this part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the offline discussion, the returned ClusterRolloutStatus will contain the group name and group index.

Also, since the timeout clusters are intended to skip and continue to rollout others, the returned value would be like below, separate clusters to rollout and clusters that are timeout.

type RolloutResult struct {
	// clusters to rollout
	ClustersToRollout map[string]ClusterRolloutStatus
	// clusters that are timeout
	ClustersTimeOut map[string]ClusterRolloutStatus
}

cluster/v1alpha1/helpers.go Outdated Show resolved Hide resolved
return true
case Succeed:
return false
case Failed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to distinguish between timeout and failed. It can be failed before reaching timeout AND as well it can be timed out but not failed (as the status will continue to reconcile it will reach failed/success later).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Failed or progressing is the status provided by consumer contoller's ClusterRolloutStatusFunc, and should return timeout in the result of ClusterRolloutStatus by the RolloutHandler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the func, when the cluster staus is progressing or failed and reach timeout, will return status TimeOut.

switch status.Status {
case ToApply:
return true
case Progressing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check for timeout within progressing state, otherwise it will progress forever ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add timeout check for progressing.

@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch from e7f8ac2 to 95ea637 Compare August 3, 2023 07:56
// then the status will be set to timeout
TimeOut
// skip rollout on this cluster
Skip RolloutStatus = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, what is the case for skip ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this because in the addon upgrade, when the user defines 2 placements, there might be clusters that overlap. Since each cluster should only belong to one placement, it should be skipped when going through another placement.

@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch from 95ea637 to 6565f0c Compare August 3, 2023 08:58
// The input is a duck type RolloutStrategy and a ClusterRolloutStatusFunc to return the rollout status on each managed cluster.
// Return the strategy actual take effect and a list of cluster that needs to update with current state.
func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {
switch rolloutStrategy.Type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior if the rollout strategy changes in the middle of a rollout?

Copy link
Member Author

@haoqing0110 haoqing0110 Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the rollout strategy changes, it should trigger another reconcile to call GetRolloutCluster() I think. The returned "clusters to rollout" based on the current status snapshot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing may not clear here, if a cluster removed/added from the placement (even without changing the rolloutStrategy) the consumer controller should track this behavior from placementDecisionTracker here
Is that correct ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where "here" points to (the URL doesn't work), the consumer controller needs to run pdTracker.Get() to refresh the placement decisions and will also get the added and deleted clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yap, the URL pointing to the pdTracker.Get() function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say for example the mandatory decision group completes and the controller has moved on to the next group (we'll call it group2). Then a new cluster is added to the mandatory decision group, GetRolloutCluster would return just that single cluster that was added and the controller would ignore group2 until that new cluster finishes?

If that's the case, what happens to the partial rollout in group2 if that new cluster in the mandatory decision group does not succeed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetRolloutCluster would return just that single cluster that was added and the controller would ignore group2 until that new cluster finishes?

Yes, I think so.

If that's the case, what happens to the partial rollout in group2 if that new cluster in the mandatory decision group does not succeed?

If group2 is partially rollout and the new cluster fails, the group2 rollout progress won't continue (clusters already upgraded won't rollback) until the new cluster succeeds.

cluster/v1alpha1/helpers.go Outdated Show resolved Hide resolved
cluster/v1alpha1/helpers.go Outdated Show resolved Hide resolved

// The input is a duck type RolloutStrategy and a ClusterRolloutStatusFunc to return the rollout status on each managed cluster.
// Return the strategy actual take effect and a list of cluster that needs to update with current state.
func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected behavior for the caller to know when a cluster or cluster group has timed out?

It seems to me that somehow returning a context with a deadline for every potential timeout would allow the caller to poll the channel of the context to know if a timeout occurred and then verify if the rollout finished before that timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serngawy mentions that some policies might keep in progressing status on some clusters for a long time, don't want them to block upgrading others, and also need to know which are timeout, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, during progressive rollout if a policies(or manifestwork) continues to be in nonCompliant state we need Timeout to not stop applying the policy for other clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm getting at is how is the controller to know when a timeout has occurred? I'd like this to be event driven as opposed to continuously polling GetRolloutCluster. My idea was a context with a timeout could be returned so the controller can be notified when the timeout has occurred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think about it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to return the timeout time in the ClusterRolloutStatus , so that the consumer controller can use this time to requeue the related resource like syncCtx.Queue().AddAfter(key, duration)

type ClusterRolloutStatus struct {
	// cluster group key
	// optional field
	GroupKey clusterv1beta1.GroupKey
	// rollout status
	// required field
	Status RolloutStatus
	// the last transition time of rollout status
	// this is used to calculate timeout for progressing and failed status
	// optional field
	LastTransitionTime *metav1.Time
	// the timeout time when status is progressing or failed
	// optional field
	TimeOutTime *metav1.Time
}

Signed-off-by: haoqing0110 <qhao@redhat.com>
Signed-off-by: haoqing0110 <qhao@redhat.com>
@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch 2 times, most recently from 236908f to 5c83fa4 Compare August 8, 2023 06:20
Signed-off-by: haoqing0110 <qhao@redhat.com>
@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch 2 times, most recently from 32c05b9 to 051d531 Compare August 8, 2023 07:15
newstatus, needtorollout := needToRollout(status, timeout)
status.Status = newstatus.Status
status.TimeOutTime = newstatus.TimeOutTime
newStatus, needToRollout := needToRollout(status, timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing that needToRollout is the same name as the needToRollout function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to newStatus, needToRollout := determineRolloutStatusAndContinue(status, timeout)

Signed-off-by: haoqing0110 <qhao@redhat.com>
@haoqing0110 haoqing0110 force-pushed the common-rollout-strategy-helper branch from 051d531 to f4db8ba Compare August 9, 2023 01:56
Copy link
Member

@mprahl mprahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! We can always iterate on this later once the policy integration design is done and implementation has started.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haoqing0110, mprahl, qiujian16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 2b2399b into open-cluster-management-io:main Aug 9, 2023
7 checks passed
@mprahl
Copy link
Member

mprahl commented Aug 9, 2023

Sorry @haoqing0110 , I didn't think my approval would count as a lgtm to cause it to merge the PR.

@haoqing0110
Copy link
Member Author

@mprahl looks good, github is so smart now.

@haoqing0110 haoqing0110 deleted the common-rollout-strategy-helper branch August 10, 2023 01:26
Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really late the the party here, sorry. I have a couple comments/questions as I'm starting to poke at the library. There may be more as I dive deeper.

Comment on lines +30 to +31
// Succeed indicates that the resource's desired status is applied and last applied status is successful.
Succeed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this would fit better as "Succeeded"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed a PR in case there's an interest in updating this: #267

type placementDecisionGetter struct {
client client.Client
type FakePlacementDecisionGetter struct {
FakeDecisions []*PlacementDecision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose for this being an array of pointers? It seems it'd be much more straightforward for the users to not have the pointer there, but maybe I'm missing an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the testing code to provide a fake placementDecisionGetter to easily initiate some decisions.
In the real code, the PlacementDecisionGetter would be like this https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/helper/helpers.go#L475

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhaiducek ^^^^

Copy link
Member

@dhaiducek dhaiducek Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. I wasn't aware of the cluster lister. The purpose of the pointer is that it's grabbing it from the cache rather than the API server. I figured it might be something like that but hadn't found the code:

err = cache.ListAll(s.indexer, selector, func(m interface{}) {
ret = append(ret, m.(*v1beta1.PlacementDecision))
})
return ret, err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placementDecisionGetter needs to be implemented by developer self, needs to implement List() interface.

In the example of https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/helper/helpers.go#L475, the clusterlister is passed from https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_controller.go#L120 placementInformer.Lister().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants