Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control pod deletion behaviour via annotation #112

Merged
merged 1 commit into from
Apr 27, 2024
Merged

Conversation

koikonom
Copy link
Contributor

This change introduces operator.redpanda.com/decommission-on-delete annotation that tells the operator to decommission a node when a pod gets deleted.

Seting this annotation allows us to upgrade the k8s worker nodes of a cluster without having to worry about launching or deleting instances because we can trigger the cloud provider's native upgrade functionality.

In essense we set the annotation and then allow the cloud provider to launch a new k8s worker node and drain the old one. Since we're now decommissioning the redpanda pod when it gets deleted the new pod that gets scheduled will no longer be stuck in Pending state due to PVC attachments still on the old k8s worker node.

@koikonom koikonom force-pushed the decomm_on_delete branch 4 times, most recently from ff8e23e to 15749c1 Compare April 11, 2024 11:55
@@ -334,6 +336,17 @@ func (r *ClusterReconciler) handlePodFinalizer(
if err != nil {
return fmt.Errorf("unable to fetch PodList: %w", err)
}

var decommissionOnDelete bool
decommissionOnDeleteVal, ok := rp.Annotations[DecommissionOnDeleteAnnotation]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me the effect of this being true? I checked and this seems like a new boolean, and found non, so how does decommission get triggered now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We convert the value to a boolean (line 343) and then in line 377 if the value is true we don't just restart the pod, but instead we decommission the redpanda node and delete the pvc as well.

Copy link

@0xdiba 0xdiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, have a discussion point on operational safety where we might need to consider changing the implementation.

This implements and closes redpanda-data/redpanda#10409

@@ -361,7 +374,7 @@ func (r *ClusterReconciler) handlePodFinalizer(
untainted = false
}
}
if untainted {
if untainted && !decommissionOnDelete {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should probably make this safer based on the description/use-case of your PR and the scenarios at hand.

Say we've set the annotation on the cluster (ie for an upgrade that takes hours) and let's consider the following scenarios:

  1. During the process, a human operator sets a configuration (requiring a restart) using rpk on the cluster. That would trigger the operator to do a whole-cluster restart. Would that delete and recreate all of the PVCs sequentially?
  2. Say due to an update/patch error or urgency of incidents the annotation is left on the cluster. Would that trigger PVC recreation on every pod delete (eg restarts/upgrades etc)?

Based on the usecases on the PR (upgrades) and the fact that the main process used there is node draining would it make sense to keep and extend the current logic and just respect the NoSchedule taint if the annotation is set?
ie adding a new case to the for-loop directly above

if decomissionOnDelete && taint.Effect == corev1.TaintEffectNoSchedule {
    untainted = false
}

That would give us more confidence we don't decomission nodes that aren't going away and have to transfer data "by mistake".

@koikonom koikonom force-pushed the decomm_on_delete branch 5 times, most recently from 855d0d0 to 12086b8 Compare April 18, 2024 12:44
This change introduces `operator.redpanda.com/decommission-on-delete`
annotation that tells the operator to decommission a node when a pod
gets deleted.

Seting this annotation allows us to upgrade the k8s worker nodes of a
cluster without having to worry about launching or deleting instances
because we can trigger the cloud provider's native upgrade
functionality.

In essense we set the annotation and then allow the cloud provider to
launch a new k8s worker node and drain the old one. Since we're now
decommissioning the redpanda pod when it gets deleted the new pod that
gets scheduled will no longer be stuck in Pending state due to PVC
attachments still on the old k8s worker node.
@koikonom koikonom marked this pull request as ready for review April 27, 2024 08:29
@RafalKorepta RafalKorepta merged commit 12bbab4 into main Apr 27, 2024
5 checks passed
RafalKorepta added a commit to redpanda-data/helm-charts that referenced this pull request May 1, 2024
What's Changed
* Remove cluster to redpanda migration tool by @RafalKorepta in redpanda-data/redpanda-operator#120
* Control pod deletion behaviour via annotation by @koikonom in redpanda-data/redpanda-operator#112
* Add Tiered Storage config options for Azure by @JakeSCahill in redpanda-data/redpanda-operator#127

New Contributors
* @koikonom made their first contribution in redpanda-data/redpanda-operator#112

**Full Changelog**: redpanda-data/redpanda-operator@v2.1.17-23.3.11...v2.1.18-23.3.13
RafalKorepta added a commit to redpanda-data/helm-charts that referenced this pull request May 1, 2024
What's Changed
* Remove cluster to redpanda migration tool by @RafalKorepta in redpanda-data/redpanda-operator#120
* Control pod deletion behaviour via annotation by @koikonom in redpanda-data/redpanda-operator#112
* Add Tiered Storage config options for Azure by @JakeSCahill in redpanda-data/redpanda-operator#127

New Contributors
* @koikonom made their first contribution in redpanda-data/redpanda-operator#112

**Full Changelog**: redpanda-data/redpanda-operator@v2.1.17-23.3.11...v2.1.18-23.3.13
@RafalKorepta RafalKorepta deleted the decomm_on_delete branch May 23, 2024 09:24
RafalKorepta added a commit that referenced this pull request Jun 18, 2024
As there 1 hour limit with cloud provider node pool migration, that is not
enough for some Redpanda cluster to finish decommission on delete feature
will be removed from Redpanda operator.

Reference

#112
RafalKorepta added a commit that referenced this pull request Jun 19, 2024
Previous usage of finalizer handlers was unreliable in the case of
flipping Kubernetes Nodes ready status. Local SSD disks that could be
attached to Redpanda Pod prevents rescheduling as the Persistent Volume
affinity bounds Pod to only one Node. In case of Kubernetes Node coming
back to live Cluster controller could already delete Redpanda data (PVC
deletion and Redpanda decommissioning). If particular Redpanda Node would
host single replica partition, then it would be a data lost.

If the majority of Redpanda process would run in unstable Kubernetes
Nodes, then Redpanda operator could break whole cluster by losing Raft
quorum.

Reference

#112
redpanda-data/redpanda#6942
RafalKorepta added a commit that referenced this pull request Jun 21, 2024
Previous usage of finalizer handlers was unreliable in the case of
flipping Kubernetes Nodes ready status. Local SSD disks that could be
attached to Redpanda Pod prevents rescheduling as the Persistent Volume
affinity bounds Pod to only one Node. In case of Kubernetes Node coming
back to live Cluster controller could already delete Redpanda data (PVC
deletion and Redpanda decommissioning). If particular Redpanda Node would
host single replica partition, then it would be a data lost.

If the majority of Redpanda process would run in unstable Kubernetes
Nodes, then Redpanda operator could break whole cluster by losing Raft
quorum.

Reference

#112
redpanda-data/redpanda#6942
RafalKorepta added a commit that referenced this pull request Jun 21, 2024
Previous usage of finalizer handlers was unreliable in the case of
flipping Kubernetes Nodes ready status. Local SSD disks that could be
attached to Redpanda Pod prevents rescheduling as the Persistent Volume
affinity bounds Pod to only one Node. In case of Kubernetes Node coming
back to live Cluster controller could already delete Redpanda data (PVC
deletion and Redpanda decommissioning). If particular Redpanda Node would
host single replica partition, then it would be a data lost.

If the majority of Redpanda process would run in unstable Kubernetes
Nodes, then Redpanda operator could break whole cluster by losing Raft
quorum.

Reference

#112
redpanda-data/redpanda#6942
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants