Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Remove node self-deletion behavior on GCP and DO
* Node self-deletion is now only performed on AWS clusters * Differentiating an unhealthy/NotRead node from a node that has shutdown (temporary reboot or permanent preemption) is tricky. To cut down on noisy alerts, we've favored having nodes delete themselves when shutdown gracefully. * Though node self-deletion is useful, we may now be able to remove this behavior on some platforms to alert in more cases where an alert is warranted * On Digital Ocean, there is no managed instance group / ASG to replace a deleted node. Losing a node for a long enough period (reboot are fine) probably merits an alert that's not sent today with the self-deletion design. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On Google Cloud, node self-deletion prevents alerting when a significant number of nodes are preempted. Fortunately, Google Cloud uses consistent naming of a node between preemptions so reboots and the daily preemption shouldn't trigger an alert. A node that's preempted and doesn't get replaced for a long enough period does merit an alert. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On AWS, spot instances lack Google Cloud's consistent naming feature. We must keep using the node self-deletion behavior to avoid preempted spot instances from accumulating and causing an alert. Self-deletion does mean in cases where many spot workers are preempted, no alert will be sent (undesired).
- Loading branch information