Merge pull request #592 from kinvolk/surajssd/how-to-upgrade-bootstra…

…p-kubelet docs: How to upgrade bootstrap kubelet?
kinvolk · Aug 27, 2020 · b889093 · b889093
2 parents 08b80a2 + b90ded7
commit b889093
Showing 1 changed file with 113 additions and 0 deletions.
diff --git a/docs/how-to-guides/upgrade-bootstrap-kubelet.md b/docs/how-to-guides/upgrade-bootstrap-kubelet.md
@@ -0,0 +1,113 @@
+# Upgrading bootstrap kubelet
+
+## Contents
+
+- [Introduction](#introduction)
+- [Steps](#steps)
+  - [Step 1: Drain the node](#step-1-drain-the-node)
+  - [Step 2: Find out the IP and SSH](#step-2-find-out-the-ip-and-ssh)
+  - [Step 3: Upgrade kubelet on node](#step-3-upgrade-kubelet-on-node)
+  - [Step 4: Verify](#step-4-verify)
+- [Caveats](#caveats)
+
+## Introduction
+
+[Kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) is a daemon
+that runs on every node and is responsible for managing Pods on the node.
+
+Lokomotive cluster runs two different sets of kubelet processes. Initially, **bootstrap** kubelet
+configured on the node as a `rkt` pod joins the cluster, and then kubelet pod managed using
+DaemonSet (self-hosted kubelet) takes over the bootstrap kubelet. Self-hosted kubelet allows
+seamless updates between Kubernetes patch versions and node configuration using tools like
+`kubectl`.
+
+Currently `lokoctl` cannot update bootstrap kubelet, so this document explains how to perform this
+operation manually.
+
+## Steps
+
+Perform the following steps on each node, one node at a time.
+
+### Step 1: Drain the node
+
+> **Caution:** If you are using a local directory as a storage for a workload, it will be disturbed
+> by this operation. To avoid this move the workload to another node and let the application
+> replicate the data. If the application does not support data replication across instances, then
+> expect downtime.
+
+```bash
+kubectl drain --ignore-daemonsets <node name>
+```
+
+### Step 2: Find out the IP and SSH
+
+Find the IP of the node by visiting the cloud provider dashboard. Then, connect to selected machine
+using SSH.
+
+```bash
+ssh core@<IP Address>
+```
+
+### Step 3: Upgrade kubelet on node
+
+Run the following commands:
+
+> **NOTE**: Before proceeding to other commands, set the `latest_kube` variable to the latest
+> Kubernetes version. Latest Kubernetes version can be found by running this command after a cluster
+> upgrade: `kubectl version -ojson | jq -r '.serverVersion.gitVersion'`.
+
+```bash
+export latest_kube=<latest kubernetes version e.g. v1.18.0>
+sudo sed -i "s|$(grep -i kubelet_image_tag /etc/kubernetes/kubelet.env)|KUBELET_IMAGE_TAG=${latest_kube}|g" /etc/kubernetes/kubelet.env
+sudo systemctl restart kubelet
+sudo journalctl -fu kubelet
+```
+
+Check the logs carefully. If kubelet fails to restart and instructs to do something (e.g. deleting
+some file), follow the instructions and reboot the node:
+
+```bash
+sudo reboot
+```
+
+### Step 4: Verify
+
+**When `disable_self_hosted_kubelet` is `true`**:
+
+Once the node reboots and kubelet rejoins the cluster, output of following command will show new
+version across the node name:
+
+```bash
+kubectl get nodes
+```
+
+**When `disable_self_hosted_kubelet` is `false`**:
+
+Verify that the kubelet service is in active (running) state:
+
+```bash
+sudo systemctl status --no-pager kubelet
+```
+
+Run the following command to see logs of the process since the last restart:
+
+```bash
+sudo journalctl _SYSTEMD_INVOCATION_ID=$(sudo systemctl \
+                show -p InvocationID --value kubelet)
+```
+
+Once you see the following log lines, you can discern that the kubelet daemon has come up without
+errors. Kubelet daemon tries to rejoin the cluster it is taken over by the self hosted kubelet pod
+and you see the following logs:
+
+```
+Version: <latest_kube>
+acquiring file lock on "/var/run/lock/kubelet.lock"
+```
+
+## Caveats
+
+- When upgrading kubelet on nodes which are running Rook Ceph, verify that the Ceph cluster is in
+  the **`HEALTH_OK`** state. If it is in any other state, **do not proceed with the upgrade** as
+  doing so could lead to data loss. When the cluster is in the `HEALTH_OK` state it can tolerate the
+  downtime caused by rebooting nodes.