Remove node self-deletion behavior on GCP and DO

* Node self-deletion is now only performed on AWS clusters * Differentiating an unhealthy/NotRead node from a node that has shutdown (temporary reboot or permanent preemption) is tricky. To cut down on noisy alerts, we've favored having nodes delete themselves when shutdown gracefully. * Though node self-deletion is useful, we may now be able to remove this behavior on some platforms to alert in more cases where an alert is warranted * On Digital Ocean, there is no managed instance group / ASG to replace a deleted node. Losing a node for a long enough period (reboot are fine) probably merits an alert that's not sent today with the self-deletion design. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On Google Cloud, node self-deletion prevents alerting when a significant number of nodes are preempted. Fortunately, Google Cloud uses consistent naming of a node between preemptions so reboots and the daily preemption shouldn't trigger an alert. A node that's preempted and doesn't get replaced for a long enough period does merit an alert. As a tradeoff, admins performing a terraform scale-down must use kubectl to manually delete nodes that are removed. * On AWS, spot instances lack Google Cloud's consistent naming feature. We must keep using the node self-deletion behavior to avoid preempted spot instances from accumulating and causing an alert. Self-deletion does mean in cases where many spot workers are preempted, no alert will be sent (undesired).
poseidon · May 17, 2018 · ab03046 · ab03046
1 parent 9ac7b06
commit ab03046
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 56 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -25,6 +25,18 @@ Notable changes between versions.
   * Use Calico's default "first-found" to support single NIC and bonded NIC nodes 
   * Allow [alternative](https://docs.projectcalico.org/v3.1/reference/node/configuration#ip-autodetection-methods) methods for multi NIC nodes, like `can-reach=IP` or `interface=REGEX`
 
+#### DigitalOcean
+
+* Discontinue worker self-deletion on graceful shutdown ([#207](https://github.com/poseidon/typhoon/pull/207))
+  * Leave worker nodes registered during reboots to alert in additional scenarios
+  * As a tradeoff, scale-down requires an admin unregister (e.g. `kubectl delete`) nodes
+
+#### Google Cloud
+
+* Discontinue worker self-deletion on graceful shutdown ([#207](https://github.com/poseidon/typhoon/pull/207))
+  * Leave worker nodes registered during reboots to alert in additional scenarios
+  * As a tradeoff, scale-down requires an admin unregister (e.g. `kubectl delete`) nodes
+
 #### Addons
 
 * Fix Prometheus data directory location ([#203](https://github.com/poseidon/typhoon/pull/203))

diff --git a/digital-ocean/container-linux/kubernetes/cl/worker.yaml.tmpl b/digital-ocean/container-linux/kubernetes/cl/worker.yaml.tmpl
@@ -79,18 +79,6 @@ systemd:
         RestartSec=5
         [Install]
         WantedBy=multi-user.target
-    - name: delete-node.service
-      enable: true
-      contents: |
-        [Unit]
-        Description=Waiting to delete Kubernetes node on shutdown
-        [Service]
-        Type=oneshot
-        RemainAfterExit=true
-        ExecStart=/bin/true
-        ExecStop=/etc/kubernetes/delete-node
-        [Install]
-        WantedBy=multi-user.target
 storage:
   files:
     - path: /etc/kubernetes/kubelet.env
@@ -105,19 +93,3 @@ storage:
       contents:
         inline: |
           fs.inotify.max_user_watches=16184
-    - path: /etc/kubernetes/delete-node
-      filesystem: root
-      mode: 0744
-      contents:
-        inline: |
-          #!/bin/bash
-          set -e
-          exec /usr/bin/rkt run \
-            --trust-keys-from-https \
-            --volume config,kind=host,source=/etc/kubernetes \
-            --mount volume=config,target=/etc/kubernetes \
-            --insecure-options=image \
-            docker://k8s.gcr.io/hyperkube:v1.10.2 \
-            --net=host \
-            --dns=host \
-            --exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)
diff --git a/google-cloud/container-linux/kubernetes/workers/cl/worker.yaml.tmpl b/google-cloud/container-linux/kubernetes/workers/cl/worker.yaml.tmpl
@@ -68,18 +68,6 @@ systemd:
         RestartSec=5
         [Install]
         WantedBy=multi-user.target
-    - name: delete-node.service
-      enable: true
-      contents: |
-        [Unit]
-        Description=Waiting to delete Kubernetes node on shutdown
-        [Service]
-        Type=oneshot
-        RemainAfterExit=true
-        ExecStart=/bin/true
-        ExecStop=/etc/kubernetes/delete-node
-        [Install]
-        WantedBy=multi-user.target
 storage:
   files:
     - path: /etc/kubernetes/kubeconfig
@@ -100,22 +88,6 @@ storage:
       contents:
         inline: |
           fs.inotify.max_user_watches=16184
-    - path: /etc/kubernetes/delete-node
-      filesystem: root
-      mode: 0744
-      contents:
-        inline: |
-          #!/bin/bash
-          set -e
-          exec /usr/bin/rkt run \
-            --trust-keys-from-https \
-            --volume config,kind=host,source=/etc/kubernetes \
-            --mount volume=config,target=/etc/kubernetes \
-            --insecure-options=image \
-            docker://k8s.gcr.io/hyperkube:v1.10.2 \
-            --net=host \
-            --dns=host \
-            --exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)
 passwd:
   users:
     - name: core