Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed re-ip error when restart the cluster #856

Merged
merged 6 commits into from
Jul 16, 2024

Conversation

cchen-vertica
Copy link
Collaborator

This PR fixed re-ip error when restart the cluster by checking all pods are up and NMA is running in all pods. Without the check, vclusterOps re-ip could fail when NMA is not running or half of the primary nodes are not up.

@@ -389,6 +392,7 @@ func (p *PodFacts) collectPodByStsIndex(ctx context.Context, vdb *vapi.VerticaDB
return err
}
pf.hasNMASidecar = vk8s.HasNMAContainer(&pod.Spec)
pf.isNMAContainerReady = vk8s.IsNMAContainerReady(pod)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when nma is not running in a separate container?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return false. Now NMA is always running as a sidecar so I didn't consider that case.

If we need to cover the case nma is not running in a seprate container, we need to update gatherScript to check if nma process exists. However, that is for the old vertica versions, and it isn't worth doing that because the ReIP error is rare and don't affect anything.

Comment on lines 215 to 219
canReIPAllDownPods := containPods(reIPPods, downPods)
if !canReIPAllDownPods {
r.Log.Info("Not all restartable pods are qualified to re-ip. Need to requeue restart reconciler")
return ctrl.Result{Requeue: true}, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this part?

@roypaulin
Copy link
Collaborator

@cchen-vertica I can't log in to my PC. I am asked to provide BitLocker recovery key. Can you reach out to me through linkedin or my personal email(paulin.nguetsop@yahoo.com).

@cchen-vertica
Copy link
Collaborator Author

@cchen-vertica I can't log in to my PC. I am asked to provide BitLocker recovery key. Can you reach out to me through linkedin or my personal email(paulin.nguetsop@yahoo.com).

Pinged you through LinkedIn.

@@ -278,6 +297,11 @@ func (r *OnlineUpgradeReconciler) postCreateNewSubclustersMsg(ctx context.Contex
// exists in the main cluster. This is a pre-step to setting up replica group B, which will
// eventually exist in its own sandbox.
func (r *OnlineUpgradeReconciler) assignSubclustersToReplicaGroupB(ctx context.Context) (ctrl.Result, error) {
// If we have already promoted sandbox to main, we don't need to do this step
if vmeta.GetOnlineUpgradeSandboxPromoted(r.VDB.Annotations) == vmeta.SandboxPromotedTrue {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to vdb: vdb.IsOnlineUpgradeSandboxPromoted().

}
return 0
})
return restartablePrimaryNodeCount >= (primaryNodeCount+1)/2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restartablePrimaryNodeCount > primaryNodeCount/2 looks simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restartablePrimaryNodeCount > primaryNodeCount/2 cannot work well. If restartablePrimaryNodeCount=2 and primaryNodeCount=4, we will return false which is wrong.

@cchen-vertica cchen-vertica merged commit 9de7c05 into vnext Jul 16, 2024
31 checks passed
@cchen-vertica cchen-vertica deleted the cchen/fix-openshift-issues branch July 16, 2024 16:58
cchen-vertica added a commit that referenced this pull request Jul 17, 2024
This PR fixed re-ip error when restart the cluster by checking all pods
are up and NMA is running in all pods. Without the check, vclusterOps
re-ip could fail when NMA is not running or half of the primary nodes
are not up.
cchen-vertica added a commit that referenced this pull request Jul 17, 2024
This PR fixed re-ip error when restart the cluster by checking all pods
are up and NMA is running in all pods. Without the check, vclusterOps
re-ip could fail when NMA is not running or half of the primary nodes
are not up.
cchen-vertica added a commit that referenced this pull request Jul 24, 2024
This PR fixed re-ip error when restart the cluster by checking all pods
are up and NMA is running in all pods. Without the check, vclusterOps
re-ip could fail when NMA is not running or half of the primary nodes
are not up.
cchen-vertica added a commit that referenced this pull request Jul 24, 2024
This PR fixed re-ip error when restart the cluster by checking all pods
are up and NMA is running in all pods. Without the check, vclusterOps
re-ip could fail when NMA is not running or half of the primary nodes
are not up.
cchen-vertica added a commit that referenced this pull request Jul 24, 2024
This PR fixed re-ip error when restart the cluster by checking all pods
are up and NMA is running in all pods. Without the check, vclusterOps
re-ip could fail when NMA is not running or half of the primary nodes
are not up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants