Fixed re-ip error when restart the cluster #856

cchen-vertica · 2024-07-11T01:56:20Z

This PR fixed re-ip error when restart the cluster by checking all pods are up and NMA is running in all pods. Without the check, vclusterOps re-ip could fail when NMA is not running or half of the primary nodes are not up.

roypaulin · 2024-07-11T20:52:48Z

pkg/controllers/vdb/podfacts.go

@@ -389,6 +392,7 @@ func (p *PodFacts) collectPodByStsIndex(ctx context.Context, vdb *vapi.VerticaDB
 			return err
 		}
 		pf.hasNMASidecar = vk8s.HasNMAContainer(&pod.Spec)
+		pf.isNMAContainerReady = vk8s.IsNMAContainerReady(pod)


What about when nma is not running in a separate container?

This will return false. Now NMA is always running as a sidecar so I didn't consider that case.

If we need to cover the case nma is not running in a seprate container, we need to update gatherScript to check if nma process exists. However, that is for the old vertica versions, and it isn't worth doing that because the ReIP error is rare and don't affect anything.

roypaulin · 2024-07-12T16:21:35Z

pkg/controllers/vdb/restart_reconciler.go

+	canReIPAllDownPods := containPods(reIPPods, downPods)
+	if !canReIPAllDownPods {
+		r.Log.Info("Not all restartable pods are qualified to re-ip. Need to requeue restart reconciler")
+		return ctrl.Result{Requeue: true}, nil
+	}


Can you explain this part?

roypaulin · 2024-07-15T13:39:41Z

@cchen-vertica I can't log in to my PC. I am asked to provide BitLocker recovery key. Can you reach out to me through linkedin or my personal email(paulin.nguetsop@yahoo.com).

cchen-vertica · 2024-07-15T16:02:29Z

@cchen-vertica I can't log in to my PC. I am asked to provide BitLocker recovery key. Can you reach out to me through linkedin or my personal email(paulin.nguetsop@yahoo.com).

Pinged you through LinkedIn.

roypaulin · 2024-07-15T17:39:12Z

pkg/controllers/vdb/onlineupgrade_reconciler.go

@@ -278,6 +297,11 @@ func (r *OnlineUpgradeReconciler) postCreateNewSubclustersMsg(ctx context.Contex
 // exists in the main cluster. This is a pre-step to setting up replica group B, which will
 // eventually exist in its own sandbox.
 func (r *OnlineUpgradeReconciler) assignSubclustersToReplicaGroupB(ctx context.Context) (ctrl.Result, error) {
+	// If we have already promoted sandbox to main, we don't need to do this step
+	if vmeta.GetOnlineUpgradeSandboxPromoted(r.VDB.Annotations) == vmeta.SandboxPromotedTrue {


Move this to vdb: vdb.IsOnlineUpgradeSandboxPromoted().

roypaulin · 2024-07-15T17:45:00Z

pkg/controllers/vdb/podfacts.go

+		}
+		return 0
+	})
+	return restartablePrimaryNodeCount >= (primaryNodeCount+1)/2


restartablePrimaryNodeCount > primaryNodeCount/2 looks simpler.

restartablePrimaryNodeCount > primaryNodeCount/2 cannot work well. If restartablePrimaryNodeCount=2 and primaryNodeCount=4, we will return false which is wrong.

This PR fixed re-ip error when restart the cluster by checking all pods are up and NMA is running in all pods. Without the check, vclusterOps re-ip could fail when NMA is not running or half of the primary nodes are not up.

cchen-vertica requested a review from roypaulin as a code owner July 11, 2024 01:56

cchen-vertica force-pushed the cchen/fix-openshift-issues branch from ddd9ef8 to 18b146c Compare July 11, 2024 17:45

fixed reip error before start_db

adac04c

cchen-vertica force-pushed the cchen/fix-openshift-issues branch from 18b146c to adac04c Compare July 11, 2024 17:45

fixed e2e tests

4299826

roypaulin reviewed Jul 11, 2024

View reviewed changes

cchen-vertica added 2 commits July 11, 2024 18:57

fixed e2e tests 2

aeb7d6d

fixed e2e tests 3

2b529b1

roypaulin reviewed Jul 12, 2024

View reviewed changes

improved online upgrade and re-ip process

0556e69

roypaulin reviewed Jul 15, 2024

View reviewed changes

addressed the comments

dc52061

cchen-vertica requested a review from roypaulin July 15, 2024 19:16

roypaulin approved these changes Jul 16, 2024

View reviewed changes

cchen-vertica merged commit 9de7c05 into vnext Jul 16, 2024
31 checks passed

cchen-vertica deleted the cchen/fix-openshift-issues branch July 16, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed re-ip error when restart the cluster #856

Fixed re-ip error when restart the cluster #856

cchen-vertica commented Jul 11, 2024

roypaulin Jul 11, 2024

cchen-vertica Jul 11, 2024

roypaulin Jul 12, 2024

roypaulin commented Jul 15, 2024

cchen-vertica commented Jul 15, 2024

roypaulin Jul 15, 2024

roypaulin Jul 15, 2024

cchen-vertica Jul 15, 2024

Fixed re-ip error when restart the cluster #856

Fixed re-ip error when restart the cluster #856

Conversation

cchen-vertica commented Jul 11, 2024

roypaulin Jul 11, 2024

Choose a reason for hiding this comment

cchen-vertica Jul 11, 2024

Choose a reason for hiding this comment

roypaulin Jul 12, 2024

Choose a reason for hiding this comment

roypaulin commented Jul 15, 2024

cchen-vertica commented Jul 15, 2024

roypaulin Jul 15, 2024

Choose a reason for hiding this comment

roypaulin Jul 15, 2024

Choose a reason for hiding this comment

cchen-vertica Jul 15, 2024

Choose a reason for hiding this comment