Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not decommission when node is miss behaving #158

Closed

Conversation

RafalKorepta
Copy link
Contributor

e8ef83d Revert decommission on delete feature

As there 1 hour limit with cloud provider node pool migration, that is not
enough for some Redpanda cluster to finish decommission on delete feature
will be removed from Redpanda operator.

ad31401 Do not decommission when Node has taints

Continue with Pod finalizer handling if allowPVCDeletion flag is not enabled or when
K8S Node does not report NoExecute taint effect with NodeUnreachable key.

Previous implementation would decommission Redpanda Pods when Node reported Unreachable
taint with NoExecute effect.

At times when K8S node has brief problems, then operator that has disabled allow-pvc-deletion flag
would decommission Redpanda Pod. This creates problem as Pod would be recreated on the same
K8S Node, but Redpanda Node ID is already decommission and can not participate in control log raft
quorum.

Reference

PR that added decommission-on-delete behaviour
#112

PR that adds finalizer to Redpanda Pods
redpanda-data/redpanda#6942

As there 1 hour limit with cloud provider node pool migration, that is not
enough for some Redpanda cluster to finish decommission on delete feature
will be removed from Redpanda operator.

Reference

#112
Continue with Pod finalizer handling if allowPVCDeletion flag is not enabled or when
K8S Node does not report NoExecute taint effect with NodeUnreachable key.
@CLAassistant
Copy link

CLAassistant commented Jun 18, 2024

CLA assistant check
All committers have signed the CLA.

@chrisseto
Copy link
Contributor

I'd like to think about this a bit more before we commit to it. I worry that we'll be making more work for cloud in the long run. It's true that the operator can overreact and cause problems but it can also under react and cause problems as well.

I've been wondering if adding a timeout to the NoExecute check would be a sufficient mitigation.

Though I'm +1 on removing the decommission on delete feature as I don't think it's currently used.

@RafalKorepta
Copy link
Contributor Author

After conversation with @chrisseto I removed even more code in #160 as finalizers does not help if we don't act upon deletion timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants