Do not decommission when node is miss behaving #158

RafalKorepta · 2024-06-18T11:53:25Z

`e8ef83d` Revert decommission on delete feature

As there 1 hour limit with cloud provider node pool migration, that is not
enough for some Redpanda cluster to finish decommission on delete feature
will be removed from Redpanda operator.

`ad31401` Do not decommission when Node has taints

Continue with Pod finalizer handling if allowPVCDeletion flag is not enabled or when
K8S Node does not report NoExecute taint effect with NodeUnreachable key.

Previous implementation would decommission Redpanda Pods when Node reported Unreachable
taint with NoExecute effect.

At times when K8S node has brief problems, then operator that has disabled allow-pvc-deletion flag
would decommission Redpanda Pod. This creates problem as Pod would be recreated on the same
K8S Node, but Redpanda Node ID is already decommission and can not participate in control log raft
quorum.

Reference

PR that added decommission-on-delete behaviour
#112

PR that adds finalizer to Redpanda Pods
redpanda-data/redpanda#6942

As there 1 hour limit with cloud provider node pool migration, that is not enough for some Redpanda cluster to finish decommission on delete feature will be removed from Redpanda operator. Reference #112

Continue with Pod finalizer handling if allowPVCDeletion flag is not enabled or when K8S Node does not report NoExecute taint effect with NodeUnreachable key.

CLAassistant · 2024-06-18T11:53:33Z

All committers have signed the CLA.

chrisseto · 2024-06-18T14:50:40Z

I'd like to think about this a bit more before we commit to it. I worry that we'll be making more work for cloud in the long run. It's true that the operator can overreact and cause problems but it can also under react and cause problems as well.

I've been wondering if adding a timeout to the NoExecute check would be a sufficient mitigation.

Though I'm +1 on removing the decommission on delete feature as I don't think it's currently used.

RafalKorepta · 2024-06-20T14:17:26Z

After conversation with @chrisseto I removed even more code in #160 as finalizers does not help if we don't act upon deletion timestamp.

RafalKorepta added 2 commits June 18, 2024 13:23

Revert decommission on delete feature

e8ef83d

As there 1 hour limit with cloud provider node pool migration, that is not enough for some Redpanda cluster to finish decommission on delete feature will be removed from Redpanda operator. Reference #112

Do not decommission when Node has taints

ad31401

Continue with Pod finalizer handling if allowPVCDeletion flag is not enabled or when K8S Node does not report NoExecute taint effect with NodeUnreachable key.

RafalKorepta requested a review from chrisseto as a code owner June 18, 2024 11:53

koikonom approved these changes Jun 18, 2024

View reviewed changes

RafalKorepta closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not decommission when node is miss behaving #158

Do not decommission when node is miss behaving #158

RafalKorepta commented Jun 18, 2024

CLAassistant commented Jun 18, 2024 •

edited

Loading

chrisseto commented Jun 18, 2024

RafalKorepta commented Jun 20, 2024

Do not decommission when node is miss behaving #158

Do not decommission when node is miss behaving #158

Conversation

RafalKorepta commented Jun 18, 2024

e8ef83d Revert decommission on delete feature

ad31401 Do not decommission when Node has taints

Reference

CLAassistant commented Jun 18, 2024 • edited Loading

chrisseto commented Jun 18, 2024

RafalKorepta commented Jun 20, 2024

`e8ef83d` Revert decommission on delete feature

`ad31401` Do not decommission when Node has taints

CLAassistant commented Jun 18, 2024 •

edited

Loading