Remove node self-deletion behavior on GCP and DO #207

dghubble · 2018-05-08T04:04:35Z

Node self-deletion is now only performed on AWS clusters
Differentiating an unhealthy/NotRead node from a node that has shutdown (temporary reboot or permanent preemption) is tricky. To cut down on noisy alerts, we've previously favored having nodes delete themselves when shutdown gracefully.
Though node self-deletion is useful, we may now be able to remove this behavior on some platforms to alert in more cases where an alert is warranted
On Digital Ocean, there is no managed instance group / ASG to replace a deleted node. Losing a node for a long enough period (reboots are fine) probably merits an alert that's not sent today with the self-deletion design. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed.
On Google Cloud, node self-deletion prevents alerting when a significant number of nodes are preempted. Fortunately, Google Cloud uses consistent naming of a node between preemptions so reboots and the daily preemption shouldn't trigger an alert. A node that's preempted and doesn't get replaced for a long enough period does merit an alert. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed.
On AWS, spot instances lack Google Cloud's consistent naming feature. We must keep using the node self-deletion behavior to avoid preempted spot instances from accumulating and causing an alert (preemption and replacement is normal and fine). Self-deletion does mean in cases where many spot workers are preempted, no alert will be sent (undesired).

* Node self-deletion is now only performed on AWS clusters * Differentiating an unhealthy/NotRead node from a node that has shutdown (temporary reboot or permanent preemption) is tricky. To cut down on noisy alerts, we've favored having nodes delete themselves when shutdown gracefully. * Though node self-deletion is useful, we may now be able to remove this behavior on some platforms to alert in more cases where an alert is warranted * On Digital Ocean, there is no managed instance group / ASG to replace a deleted node. Losing a node for a long enough period (reboot are fine) probably merits an alert that's not sent today with the self-deletion design. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On Google Cloud, node self-deletion prevents alerting when a significant number of nodes are preempted. Fortunately, Google Cloud uses consistent naming of a node between preemptions so reboots and the daily preemption shouldn't trigger an alert. A node that's preempted and doesn't get replaced for a long enough period does merit an alert. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On AWS, spot instances lack Google Cloud's consistent naming feature. We must keep using the node self-deletion behavior to avoid preempted spot instances from accumulating and causing an alert. Self-deletion does mean in cases where many spot workers are preempted, no alert will be sent (undesired).

dghubble · 2018-05-18T03:04:38Z

I just can't really overlook the difficulty this presents for true autoscaling cases, either offered by the cloud provider or via a privileged agent.

dghubble added platform/aws platform/google-cloud platform/digital-ocean labels May 8, 2018

dghubble force-pushed the dont-delete-on-shutdown branch 2 times, most recently from 54ef93d to 58e3b1e Compare May 10, 2018 08:54

dghubble mentioned this pull request May 10, 2018

Fedora Atomic tracking issue #200

Closed

7 tasks

dghubble force-pushed the dont-delete-on-shutdown branch from 58e3b1e to b36da9c Compare May 10, 2018 09:17

dghubble removed the platform/aws label May 10, 2018

dghubble force-pushed the dont-delete-on-shutdown branch from b36da9c to c73cde5 Compare May 16, 2018 06:57

dghubble force-pushed the dont-delete-on-shutdown branch from c73cde5 to ab03046 Compare May 17, 2018 03:30

dghubble closed this May 18, 2018

dghubble deleted the dont-delete-on-shutdown branch May 18, 2018 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove node self-deletion behavior on GCP and DO #207

Remove node self-deletion behavior on GCP and DO #207

dghubble commented May 8, 2018 •

edited

Loading

dghubble commented May 18, 2018

Remove node self-deletion behavior on GCP and DO #207

Remove node self-deletion behavior on GCP and DO #207

Conversation

dghubble commented May 8, 2018 • edited Loading

dghubble commented May 18, 2018

dghubble commented May 8, 2018 •

edited

Loading