🌱 Update rollout lib #276

serngawy · 2023-09-11T21:39:50Z

Summary

Related issue(s)

Fixes #

serngawy · 2023-09-11T21:40:40Z

/assign @haoqing0110

serngawy · 2023-09-11T21:42:22Z

Adding
@dhaiducek

haoqing0110 · 2023-09-12T07:50:04Z

cluster/v1alpha1/helpers.go

 		if needToRollout {
-			rolloutClusters[cluster] = status
+			rolloutClusters = append(rolloutClusters, status)


Need to check the rolloutClusters length I think, should not put all the needToRollout existing clusters into the result.

This iterate on the existing clusterRolloutStatus, Assuming it shouldn't never reach the length. okay will add it for safety.

For addon, there's the possibility that 3 managedclusteraddons are created(existing) and status is "ToApply", if don't check the length, all of the 3 will be returned.

I'm assuming the existingClusterStatus hold only clusterStatus; progressing, succeeded, failed and timeout cause the clusterStatus ToApply means the cluster workload does not exist yet. So as addon controller you should not add clusterStatus ToApply to the existingClusterStatus cause the addon (workload) does not exist. What do you think ?
I will add check to ignore the ToApply clusterStatus at the existingClusterStatus loop.

I added the check for the ToApply status, so if the existingClusterStatus has all clusters in ToApply status, the RolloutResult will contain only the required maxConcurrency or the required decisionGroup. Added unitTest as well for this case.

I think the ToApply means the desired status not applied yet, no matter it's a fresh install (no current workload) or upgrade case (current workload does not match the desired workload) .

yes, I added the check for the ToApply state as well.

haoqing0110 · 2023-09-12T07:55:52Z

cluster/v1alpha1/helpers.go


-	clusterGroupKeys := clusterGroupsMap.GetOrderedGroupKeys()
+	for _, status := range existingClusterStatus {


This function is to rollout by group, why go through the existingClusterStatus and put it to result? what will the returned rolloutClusters be like?

it return the clusters that are still progressing, failing (before the timeout) and new clusters add to the group after the rollout starts

Still have some concerns about going through the existingClusterStatus first and then going through the rest of the clusters, for both progressivePerCluster and progressivePerGroup.

For example, the env has 100 clusters with status Succeed, then rollout starts, what's the expected cluster status returned by existingClusterStatus? For addon existingClusterStatus will return ToApply when rollout starts, in this case, all the clusters will be returned by GetRolloutCluster(), which doesn't make sense.

I understand your point, same as above let me know your thoughts.

Another gap after this change is that, in addon, one cma will have multiple placement, there's placement overlap. For example:
The config-v1 existingClusterStatus has cluster1 and config-v2 existingClusterStatus has cluster2, cluster3. For config-v1, cluster2 is not in existing cluster and it should also not be treated as added cluster and return in rollout result.

installStrategy: type: Placements placements: - name: placement1 // placement1 select cluster1 and cluster2 namespace: default configs: - group: addon.open-cluster-management.io resource: addonhubconfigs name: config-v1 rolloutStrategy: type: Progressive - name: placement2 // placement2 select cluster2 and cluster3 namespace: default configs: - group: addon.open-cluster-management.io resource: addonhubconfigs name: config-v2 rolloutStrategy: type: Progressive

I'm thinking should we just rollout the clusters returned by ClusterRolloutStatusFunc and put added/deleted cluster in another field.

The removed (deleted) clusters are already in another field ClustersRemoved .
Regards the added clusters, not sure if I understand the example above correctly. Config-v1 is associated with placement1 and config-v2 is associated with placement2; those are 2 different configs with 2 different placements. However; as cluster-2 is associated with config-v1 and config-v2, cluster-2 must apply config-v2 as it has higher version. Is that correct ?
If yes, I guess this is more addon logic that is need to be handled inside the addon controller. what do you think ?

Addon has returned a SKIP status for cluster2 when rollout config-v1 to indicate not to rollout it. Not sure if PR can handle the case when cluster2 is not in existingClusterStatus and user want to SKIP rollout it.

so the idea for existingClusterStatus; if the the existing cluster has Skip status it will not be considered in the rolloutResult->ClustersToRollout. The addon set the cluster2 ClusterRolloutStatus to skip (for configv1 and placement1) then the rolloutResult->ClustersToRollout will not have it. I added comments here as well

Yes, addon might need to put SKIP cluster in existingClusterStatus so that won't put it in ClustersToRollout. Sounds good.

dhaiducek

Thanks for the update! I've left some minor comments.

(Regarding the current discussion about removed clusters: In general, I like having the removed clusters available to the controller so that they can be handled right away, but I'm still processing the existing discussion about it.)

dhaiducek · 2023-09-13T20:02:06Z

cluster/v1alpha1/helpers.go

 }

+// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster.


Suggested change

// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster.

// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster for a given workload.

dhaiducek · 2023-09-14T15:13:52Z

cluster/v1alpha1/helpers.go

-	// ClustersToRollout is a map where the key is the cluster name and the value is the ClusterRolloutStatus.
-	ClustersToRollout map[string]ClusterRolloutStatus
-	// ClustersTimeOut is a map where the key is the cluster name and the value is the ClusterRolloutStatus.
-	ClustersTimeOut map[string]ClusterRolloutStatus
+	// ClustersToRollout is a slice of ClusterRolloutStatus that will be rolled out.
+	ClustersToRollout []ClusterRolloutStatus
+	// ClustersTimeOut is a slice of ClusterRolloutStatus that are timeout.
+	ClustersTimeOut []ClusterRolloutStatus
+	// ClustersRemoved is a slice of ClusterRolloutStatus that are removed.
+	ClustersRemoved []ClusterRolloutStatus


I'm assuming this is an array now so that the mapping to each cluster's name is clearer to the end user? I kind of liked the map[string]ClusterRolloutStatus for uniqueness and simplicity.

I understand its easier to iterate with a map but with the changes that happen we don't really need the clusterName to be in a map. Plus its better to move the clusterName to the ClusterRolloutStatus struct to be identifiable object by itself.

dhaiducek · 2023-09-14T15:26:01Z

cluster/v1alpha1/helpers.go

+	// Calculate the length for progressive rollOut
+	// If the MaxConcurrency not defined, total clusters length is considered as maxConcurrency.
+	clusterGroups = r.pdTracker.ExistingClusterGroupsBesides(groupKeys...)
+	length, err := calculateLength(strategy.Progressive.MaxConcurrency, len(clusterGroups.GetClusters()))


Aside: This is outside of this PR's intent, but could we consider renaming calculateLength() to something more descriptive like calculateRolloutSize()?

dhaiducek · 2023-09-14T15:32:01Z

cluster/v1alpha1/helpers.go

+		for _, cluster := range clusters {
+			if clusterStatus.ClusterName == cluster {
+				exist = true
+				currentClusterStatus = append(currentClusterStatus, clusterStatus)


Should there be a break here so that the loop doesn't continue?

dhaiducek · 2023-09-14T15:33:17Z

cluster/v1alpha1/helpers.go

+		}
+
+		existingClusters[status.ClusterName] = true
+		if status.Status == Succeeded || status.Status == TimeOut {


Optional: This could be a switch statement.

okay, done.

dhaiducek · 2023-09-14T15:38:54Z

cluster/v1alpha1/helpers.go

+	if len(rolloutClusters) >= length {
 		return RolloutResult{
 			ClustersToRollout: rolloutClusters,
 			ClustersTimeOut:   timeoutClusters,


Isn't this check redundant since it's already in the loop above?

yes, removed.

haoqing0110 · 2023-09-15T02:33:26Z

cluster/v1alpha1/helpers.go

+	timeoutClusters := []ClusterRolloutStatus{}
+	existingClusters := make(map[string]bool)
+
+	for _, status := range existingClusterStatus {


Should here sort the existingClusterStatus by name by alphabetical order to ensure every time returns the same result?

haoqing0110 · 2023-09-15T03:42:01Z

cluster/v1alpha1/helpers.go

+			// Set as false to consider the cluster in the decisionGroups iteration.
+			existingClusters[status.ClusterName] = false
+		case Failed, Progressing:
+			newStatus, needToRollout := determineRolloutStatusAndContinue(status, timeout)


Can here combine the switch case and determineRolloutStatusAndContinue's switch case into one function?

okay, done.

qiujian16 · 2023-09-15T03:10:34Z

cluster/v1alpha1/helpers.go

 // +k8s:deepcopy-gen=false
-type RolloutHandler struct {
+type RolloutHandler[T any] struct {


should it be runtime.Object rather than any?

runtime.Object will required casting for the used Type and We want the RolloutHandler to be initiated with a certain type (eg; manifestwork, policy, addon, ...etc) not a runtime.object

but any is a more loose constraint, isn't it?

Well, yes "any" is kind of loose, but we are safe. The main idea behind make a generic Type "any" is to avoid doing casting (or force certain type) at the clusterRolloutStatusFunc Function implementation similar to the unit test example here AND the helper lib is safe we don't call the clusterRolloutStatusFunc any where, its just a func definition that will be used to create the existing ClustersRollOutStatus at the consumer API side. What do you think ? Is that make sense ?

Run make update will fail on deepcopy functions generating step:

Generating deepcopy funcs W1017 14:10:32.195161 62703 parse.go:863] Making unsupported type entry "T" for: &types.TypeParam{check:(*types.Checker)(nil), id:0x1, obj:(*types.TypeName)(0xc009651900), index:0, bound:(*types.Interface)(0xc00009a960)} F1017 14:10:32.221298 62703 deepcopy.go:890] Hit an unsupported type open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatusFunc[T] for open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatusFunc[T], from open-cluster-management.io/api/cluster/v1alpha1.RolloutHandler[T]

Seems deepcopy lib doesn't support generic type yet. kubernetes/gengo#225

Either we revert the normal struct or we disable package-level generation and explicitly add the comment to each struct we need do deepcopy function generating: #288

cc @haoqing0110

Hi, @dhaiducek , I test controller-gen with latest api main branch, it fails and generates unexpected code for generic type:

./_output/tools/bin/controller-gen object:headerFile="hack/empty.txt" paths="./cluster/v1alpha1" open-cluster-management.io/api/cluster/v1alpha1:-: invalid type: func(clusterName string, workload T) (open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatus, error) Error: not all generators ran successfully run `controller-gen object:headerFile=hack/empty.txt paths=./cluster/v1alpha1 -w` to see all available markers, or `controller-gen object:headerFile=hack/empty.txt paths=./cluster/v1alpha1 -h` for usage

As I mentioned above, we can also change to the explicit way to claim which struct needs gen deepcopy funcs: #288

Actually it seems the gengo tooling may not be respecting the deepcopy-gen=false tag? In any case, I don't have a great deal of familiarity with deepcopy, but here's a commit that might work as an alternative to selectively generating the APIs:

dhaiducek@3734d0e

Thanks @xuezhaojun I'm able to reproduce it as well. having the same issue as @dhaiducek sounds like deepCopy generator does not respect the deepcopy-gen=false tag !
Based on the deepCopy code doc here setting the deepcopy-gen=false should work.

Thanks for providing the alternative! I will follow with it.

qiujian16 · 2023-09-15T03:12:14Z

cluster/v1alpha1/helpers.go

@@ -83,20 +88,21 @@ func NewRolloutHandler(pdTracker *clusterv1beta1.PlacementDecisionClustersTracke
 //
 // ClustersTimeOut: If the cluster status is Progressing or Failed, and the status lasts longer than timeout defined in strategy,
 // will list them RolloutResult.ClustersTimeOut with status TimeOut.
-func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {
+// func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {


remove this line

qiujian16 · 2023-09-15T03:14:16Z

cluster/v1alpha1/helpers.go

+}
+
+func progressivePerCluster(clusterGroupsMap clusterv1beta1.ClusterGroupsMap, length int, timeout time.Duration, existingClusterStatus []ClusterRolloutStatus) RolloutResult {
+	rolloutClusters := []ClusterRolloutStatus{}


use var rolloutClusters, timeoutClusters []ClusterRolloutStatus

Signed-off-by: melserngawy <melserng@redhat.com>

qiujian16 · 2023-09-25T14:02:16Z

/approve
/lgtm

openshift-ci · 2023-09-25T14:02:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qiujian16, serngawy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [qiujian16]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: melserngawy <melserng@redhat.com>

Signed-off-by: melserngawy <melserng@redhat.com> Fix: run `make update` fail because deepcopy doesn't support generic type. (open-cluster-management-io#288) Signed-off-by: xuezhaojun <zxue@redhat.com> Fix verify ci step missing. (open-cluster-management-io#289) Signed-off-by: xuezhaojun <zxue@redhat.com> :bug: Use controller-runtime for deepcopy generation for cluster:v1alpha1 (open-cluster-management-io#291) * Revert "Fix: run `make update` fail because deepcopy doesn't support generic type. (open-cluster-management-io#288)" This reverts commit ae208c8. Signed-off-by: Dale Haiducek <19750917+dhaiducek@users.noreply.github.com> * Use `controller-gen` for deepcopy cluster:v1alpha1 GenGo isn't respecting the `+k8s:deepcopy-gen=false` tag to skip generation for the generic type Signed-off-by: Dale Haiducek <19750917+dhaiducek@users.noreply.github.com> --------- Signed-off-by: Dale Haiducek <19750917+dhaiducek@users.noreply.github.com> 🐛 add ca bundle to addon proxy settings (open-cluster-management-io#293) Signed-off-by: Yang Le <yangle@redhat.com> Revert "remove ClusterSet ClusterSetBinding API version v1beta1 (open-cluster-management-io#266)" This reverts commit 9675ab7. Signed-off-by: haoqing0110 <qhao@redhat.com>

Fix: run `make update` fail because deepcopy doesn't support generic type. (#288) Fix verify ci step missing. (#289) :bug: Use controller-runtime for deepcopy generation for cluster:v1alpha1 (#291) * Revert "Fix: run `make update` fail because deepcopy doesn't support generic type. (#288)" This reverts commit ae208c8. * Use `controller-gen` for deepcopy cluster:v1alpha1 GenGo isn't respecting the `+k8s:deepcopy-gen=false` tag to skip generation for the generic type --------- 🐛 add ca bundle to addon proxy settings (#293) Revert "remove ClusterSet ClusterSetBinding API version v1beta1 (#266)" This reverts commit 9675ab7. Signed-off-by: haoqing0110 <qhao@redhat.com> Co-authored-by: Mohamed ElSerngawy <melserng@redhat.com>

openshift-ci bot requested review from mdelder and qiujian16 September 11, 2023 21:39

openshift-ci bot assigned haoqing0110 Sep 11, 2023

serngawy force-pushed the rolloutLib branch from 1b3f924 to bd8457a Compare September 11, 2023 21:50

haoqing0110 reviewed Sep 12, 2023

View reviewed changes

serngawy force-pushed the rolloutLib branch 3 times, most recently from b6551e1 to c92b89c Compare September 13, 2023 15:13

dhaiducek reviewed Sep 14, 2023

View reviewed changes

serngawy force-pushed the rolloutLib branch from c92b89c to 718293e Compare September 14, 2023 18:08

haoqing0110 reviewed Sep 15, 2023

View reviewed changes

serngawy force-pushed the rolloutLib branch from 718293e to 003b118 Compare September 15, 2023 16:21

qiujian16 reviewed Sep 20, 2023

View reviewed changes

Update rollout lib

ac0fe2c

Signed-off-by: melserngawy <melserng@redhat.com>

serngawy force-pushed the rolloutLib branch from 003b118 to ac0fe2c Compare September 20, 2023 21:43

openshift-ci bot assigned qiujian16 Sep 25, 2023

openshift-ci bot added the lgtm label Sep 25, 2023

openshift-ci bot added the approved label Sep 25, 2023

openshift-merge-robot merged commit bf4f47e into open-cluster-management-io:main Sep 25, 2023
10 checks passed

dhaiducek mentioned this pull request Sep 25, 2023

✨ Replace Timeout with RolloutConfig #281

Merged

xuezhaojun mentioned this pull request Oct 17, 2023

🐛 run make update fail because deepcopy doesn't support generic … #288

Merged

haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023

Update rollout lib (open-cluster-management-io#276)

3da7124

Signed-off-by: melserngawy <melserng@redhat.com>

haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023

Update rollout lib (open-cluster-management-io#276)

1a09824

Signed-off-by: melserngawy <melserng@redhat.com>

haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023

Update rollout lib (open-cluster-management-io#276)

c281018

Signed-off-by: melserngawy <melserng@redhat.com>


		clusterGroupKeys := clusterGroupsMap.GetOrderedGroupKeys()
		for _, status := range existingClusterStatus {

		}

		// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster.

🌱 Update rollout lib #276

🌱 Update rollout lib #276

Conversation

serngawy commented Sep 11, 2023

Summary

Related issue(s)

serngawy commented Sep 11, 2023

serngawy commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haoqing0110 Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haoqing0110 Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaiducek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haoqing0110 Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuezhaojun Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiujian16 commented Sep 25, 2023

openshift-ci bot commented Sep 25, 2023

haoqing0110 Sep 13, 2023 •

edited

Loading

haoqing0110 Sep 13, 2023 •

edited

Loading

haoqing0110 Sep 15, 2023 •

edited

Loading

xuezhaojun Oct 17, 2023 •

edited

Loading