Skip to content

Commit

Permalink
Remove node self-deletion behavior on GCP and DO
Browse files Browse the repository at this point in the history
* Node self-deletion is now only performed on AWS clusters
* Differentiating an unhealthy/NotRead node from a node that
has shutdown (temporary reboot or permanent preemption) is
tricky. To cut down on noisy alerts, we've favored having nodes
delete themselves when shutdown gracefully.
* Though node self-deletion is useful, we may now be able to
remove this behavior on some platforms to alert in more cases
where an alert is warranted
* On Digital Ocean, there is no managed instance group / ASG to
replace a deleted node. Losing a node for a long enough period
(reboot are fine) probably merits an alert that's not sent today
with the self-deletion design. As a tradeoff, admins performing
a terraform scale-down must use kubectl to manually delete nodes
that are removed.
* On Google Cloud, node self-deletion prevents alerting when a
significant number of nodes are preempted. Fortunately, Google
Cloud uses consistent naming of a node between preemptions so
reboots and the daily preemption shouldn't trigger an alert.
A node that's preempted and doesn't get replaced for a long
enough period does merit an alert. As a tradeoff, admins
performing a terraform scale-down must use kubectl to manually
delete nodes that are removed.
* On AWS, spot instances lack Google Cloud's consistent naming
feature. We must keep using the node self-deletion behavior to
avoid preempted spot instances from accumulating and causing
an alert. Self-deletion does mean in cases where many spot workers
are preempted, no alert will be sent (undesired).
  • Loading branch information
dghubble committed May 17, 2018
1 parent 9ac7b06 commit ab03046
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 56 deletions.
12 changes: 12 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ Notable changes between versions.
* Use Calico's default "first-found" to support single NIC and bonded NIC nodes
* Allow [alternative](https://docs.projectcalico.org/v3.1/reference/node/configuration#ip-autodetection-methods) methods for multi NIC nodes, like `can-reach=IP` or `interface=REGEX`

#### DigitalOcean

* Discontinue worker self-deletion on graceful shutdown ([#207](https://github.com/poseidon/typhoon/pull/207))
* Leave worker nodes registered during reboots to alert in additional scenarios
* As a tradeoff, scale-down requires an admin unregister (e.g. `kubectl delete`) nodes

#### Google Cloud

* Discontinue worker self-deletion on graceful shutdown ([#207](https://github.com/poseidon/typhoon/pull/207))
* Leave worker nodes registered during reboots to alert in additional scenarios
* As a tradeoff, scale-down requires an admin unregister (e.g. `kubectl delete`) nodes

#### Addons

* Fix Prometheus data directory location ([#203](https://github.com/poseidon/typhoon/pull/203))
Expand Down
28 changes: 0 additions & 28 deletions digital-ocean/container-linux/kubernetes/cl/worker.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -79,18 +79,6 @@ systemd:
RestartSec=5
[Install]
WantedBy=multi-user.target
- name: delete-node.service
enable: true
contents: |
[Unit]
Description=Waiting to delete Kubernetes node on shutdown
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/true
ExecStop=/etc/kubernetes/delete-node
[Install]
WantedBy=multi-user.target
storage:
files:
- path: /etc/kubernetes/kubelet.env
Expand All @@ -105,19 +93,3 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/kubernetes/delete-node
filesystem: root
mode: 0744
contents:
inline: |
#!/bin/bash
set -e
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.10.2 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)
Original file line number Diff line number Diff line change
Expand Up @@ -68,18 +68,6 @@ systemd:
RestartSec=5
[Install]
WantedBy=multi-user.target
- name: delete-node.service
enable: true
contents: |
[Unit]
Description=Waiting to delete Kubernetes node on shutdown
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/true
ExecStop=/etc/kubernetes/delete-node
[Install]
WantedBy=multi-user.target
storage:
files:
- path: /etc/kubernetes/kubeconfig
Expand All @@ -100,22 +88,6 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/kubernetes/delete-node
filesystem: root
mode: 0744
contents:
inline: |
#!/bin/bash
set -e
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.10.2 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)
passwd:
users:
- name: core
Expand Down

0 comments on commit ab03046

Please sign in to comment.