Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove node self-deletion behavior on GCP and DO #207

Closed
wants to merge 1 commit into from

Conversation

dghubble
Copy link
Member

@dghubble dghubble commented May 8, 2018

  • Node self-deletion is now only performed on AWS clusters
  • Differentiating an unhealthy/NotRead node from a node that has shutdown (temporary reboot or permanent preemption) is tricky. To cut down on noisy alerts, we've previously favored having nodes delete themselves when shutdown gracefully.
  • Though node self-deletion is useful, we may now be able to remove this behavior on some platforms to alert in more cases where an alert is warranted
  • On Digital Ocean, there is no managed instance group / ASG to replace a deleted node. Losing a node for a long enough period (reboots are fine) probably merits an alert that's not sent today with the self-deletion design. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed.
  • On Google Cloud, node self-deletion prevents alerting when a significant number of nodes are preempted. Fortunately, Google Cloud uses consistent naming of a node between preemptions so reboots and the daily preemption shouldn't trigger an alert. A node that's preempted and doesn't get replaced for a long enough period does merit an alert. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed.
  • On AWS, spot instances lack Google Cloud's consistent naming feature. We must keep using the node self-deletion behavior to avoid preempted spot instances from accumulating and causing an alert (preemption and replacement is normal and fine). Self-deletion does mean in cases where many spot workers are preempted, no alert will be sent (undesired).

@dghubble dghubble force-pushed the dont-delete-on-shutdown branch 2 times, most recently from 54ef93d to 58e3b1e Compare May 10, 2018 08:54
@dghubble dghubble mentioned this pull request May 10, 2018
7 tasks
@dghubble dghubble force-pushed the dont-delete-on-shutdown branch from 58e3b1e to b36da9c Compare May 10, 2018 09:17
@dghubble dghubble force-pushed the dont-delete-on-shutdown branch from b36da9c to c73cde5 Compare May 16, 2018 06:57
* Node self-deletion is now only performed on AWS clusters
* Differentiating an unhealthy/NotRead node from a node that
has shutdown (temporary reboot or permanent preemption) is
tricky. To cut down on noisy alerts, we've favored having nodes
delete themselves when shutdown gracefully.
* Though node self-deletion is useful, we may now be able to
remove this behavior on some platforms to alert in more cases
where an alert is warranted
* On Digital Ocean, there is no managed instance group / ASG to
replace a deleted node. Losing a node for a long enough period
(reboot are fine) probably merits an alert that's not sent today
with the self-deletion design. As a tradeoff, admins performing
a terraform scale-down must use kubectl to manually delete nodes
that are removed.
* On Google Cloud, node self-deletion prevents alerting when a
significant number of nodes are preempted. Fortunately, Google
Cloud uses consistent naming of a node between preemptions so
reboots and the daily preemption shouldn't trigger an alert.
A node that's preempted and doesn't get replaced for a long
enough period does merit an alert. As a tradeoff, admins
performing a terraform scale-down must use kubectl to manually
delete nodes that are removed.
* On AWS, spot instances lack Google Cloud's consistent naming
feature. We must keep using the node self-deletion behavior to
avoid preempted spot instances from accumulating and causing
an alert. Self-deletion does mean in cases where many spot workers
are preempted, no alert will be sent (undesired).
@dghubble dghubble force-pushed the dont-delete-on-shutdown branch from c73cde5 to ab03046 Compare May 17, 2018 03:30
@dghubble
Copy link
Member Author

I just can't really overlook the difficulty this presents for true autoscaling cases, either offered by the cloud provider or via a privileged agent.

@dghubble dghubble closed this May 18, 2018
@dghubble dghubble deleted the dont-delete-on-shutdown branch May 18, 2018 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant