From ae8bedb9a0d999bfbe97b6e18dc2eff62f0fcb80 Mon Sep 17 00:00:00 2001 From: Andrey Smirnov Date: Wed, 10 Mar 2021 00:55:21 +0300 Subject: [PATCH] docs: add control plane conversion guide and 0.9 upgrade notes These docs are critical to get 0.9.0-beta released. Signed-off-by: Andrey Smirnov --- .../v0.9/Guides/converting-control-plane.md | 255 ++++++++++++++++++ .../docs/v0.9/Guides/upgrading-talos.md | 63 ++++- .../content/docs/v0.9/Learn More/upgrades.md | 16 -- 3 files changed, 309 insertions(+), 25 deletions(-) create mode 100644 website/content/docs/v0.9/Guides/converting-control-plane.md diff --git a/website/content/docs/v0.9/Guides/converting-control-plane.md b/website/content/docs/v0.9/Guides/converting-control-plane.md new file mode 100644 index 0000000000..6f11b0bea7 --- /dev/null +++ b/website/content/docs/v0.9/Guides/converting-control-plane.md @@ -0,0 +1,255 @@ +--- +title: "Converting Control Plane" +description: "How to convert Talos self-hosted Kubernetes control plane (pre-0.9) to static pods based one." +--- + +Talos version 0.9 runs Kubernetes control plane in a new way: static pods managed by Talos. +Talos version 0.8 and below runs self-hosted control plane. +After Talos OS upgrade to version 0.9 Kubernetes control plane should be converted to run as static pods. + +This guide describes automated conversion script and also shows detailed manual conversion process. + +## Automated Conversion + +First, make sure all nodes are updated to Talos 0.9: + +```bash +$ kubectl get nodes -o wide +NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME +talos-default-master-1 Ready control-plane,master 58m v1.20.4 172.20.0.2 Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 +talos-default-master-2 Ready control-plane,master 58m v1.20.4 172.20.0.3 Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 +talos-default-master-3 Ready control-plane,master 58m v1.20.4 172.20.0.4 Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 +talos-default-worker-1 Ready 58m v1.20.4 172.20.0.5 Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 +``` + +Start the conversion script: + +```bash +$ talosctl -n convert-k8s +discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"] +current self-hosted status: true +gathering control plane configuration +aggregator CA key can't be recovered from bootkube-boostrapped control plane, generating new CA +patching master node "172.20.0.2" configuration +patching master node "172.20.0.3" configuration +patching master node "172.20.0.4" configuration +waiting for static pod definitions to be generated +waiting for manifests to be generated +Talos generated control plane static pod definitions and bootstrap manifests, please verify them with commands: + talosctl -n get StaticPods.kubernetes.talos.dev + talosctl -n get Manifests.kubernetes.talos.dev + +bootstrap manifests will only be applied for missing resources, existing resources will not be updated +confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: +``` + +Script stops at this point waiting for confirmation. +Talos still runs self-hosted control plane, and static pods were not rendered yet. + +As instructed by the script, please verify that static pod definitions are correct: + +```bash +$ talosctl -n get staticpods -o yaml +node: 172.20.0.2 +metadata: + namespace: controlplane + type: StaticPods.kubernetes.talos.dev + id: kube-apiserver + version: 1 + phase: running +spec: + apiVersion: v1 + kind: Pod + metadata: + annotations: + talos.dev/config-version: "2" + talos.dev/secrets-version: "1" + creationTimestamp: null + labels: + k8s-app: kube-apiserver + tier: control-plane + name: kube-apiserver + namespace: kube-system + spec: + containers: + - command: +... +``` + +Static pod definitions are generated from the machine configuration and should match pod template as generated by Talos on bootstrap of self-hosted control plane unless there were some manual changes applied to the daemonset specs after bootstrap. +Talos patches the machine configuration with the container image versions scraped from the daemonset definition, fetches the service account key from Kubernetes secrets. + +Aggregator CA can't be recovered from the self-hosted control plane, so new CA gets generated. +This is generally harmless and not visible from outside the cluster. +The Aggregator CA is _not_ the same CA as is used by Talos or Kubernetes standard API. +It is a special PKI used for aggregating API extension services inside your cluster. +If you have non-standard apiserver aggregations (fairly rare, and you should know if you do), then you may need to restart these services after the new CA is in place. + +Verify that bootstrap manifests are correct: + +```bash +$ talosctl -n get manifests --namespace controlplane +NODE NAMESPACE TYPE ID VERSION +172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1 +172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1 +172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1 +172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1 +172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1 +172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1 +172.20.0.2 controlplane Manifest 10-kube-proxy 1 +172.20.0.2 controlplane Manifest 11-core-dns 1 +172.20.0.2 controlplane Manifest 11-core-dns-svc 1 +172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1 +``` + +```bash +$ talosctl -n get manifests --namespace=extras +NODE NAMESPACE TYPE ID VERSION +172.20.0.2 extras Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1 +``` + +Make sure that manifests and static pods are correct across all control plane nodes, as each node reconciles +control plane state on its own. +For example, CNI configuration in machine config should be in sync across all the nodes. +Talos nodes try to create any missing Kubernetes resources from the manifests, but it never +updates or deletes existing resources. + +If something looks wrong, script can be aborted and machine configuration should be updated to fix the problem. +Once configuration is updated, the script can be restarted. + +If static pod definitions and manifests look good, confirm next step to disable `pod-checkpointer`: + +```bash +$ talosctl -n convert-k8s +... +confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: yes +disabling pod-checkpointer +deleting daemonset "pod-checkpointer" +checking for active pod checkpoints +2021/03/09 23:37:25 retrying error: found 3 active pod checkpoints: [pod-checkpointer-655gc-talos-default-master-3 pod-checkpointer-pw6mv-talos-default-master-1 pod-checkpointer-zdw9z-talos-default-master-2] +2021/03/09 23:42:25 retrying error: found 1 active pod checkpoints: [pod-checkpointer-pw6mv-talos-default-master-1] +confirm applying static pod definitions and manifests [yes/no]: +``` + +Self-hosted control plane runs `pod-checkpointer` to work around issues with control plane availability. +It should be disabled before conversion starts to allow self-hosted control plane to be removed. +It takes around 5 minutes for the `pod-checkpointer` to be fully disabled. +Script verifies that all checkpoints are removed before proceeding. + +This last confirmation before proceeding is at the point when there is no way to keep running self-hosted control plane: +static pods are released, bootstrap manifests are applied, self-hosted control plane is removed. + +```bash +$ talosctl -n convert-k8s +... +confirm applying static pod definitions and manifests [yes/no]: yes +removing self-hosted initialized key +waiting for static pods for "kube-apiserver" to be present in the API server state +waiting for static pods for "kube-controller-manager" to be present in the API server state +waiting for static pods for "kube-scheduler" to be present in the API server state +deleting daemonset "kube-apiserver" +waiting for static pods for "kube-apiserver" to be present in the API server state +deleting daemonset "kube-controller-manager" +waiting for static pods for "kube-controller-manager" to be present in the API server state +deleting daemonset "kube-scheduler" +waiting for static pods for "kube-scheduler" to be present in the API server state +conversion process completed successfully +``` + +As soon as the control plane static pods are rendered, the kubelet starts the control plane static pods. +It is expected that the pods for `kube-apiserver` will crash initially. +Only one `kube-apiserver` can be bound to the host `Node`'s port 6443 at a time. +Eventually, the old `kube-apiserver` will be killed, and the new one will be able to start. +This is all handled automatically. +The script will continue by removing each self-hosted daemonset and verifying that static pods are ready and healthy. + +## Manual Conversion + +Check that Talos runs self-hosted control plane: + +```bash +$ talosctl -n get bs +NODE NAMESPACE TYPE ID VERSION SELF HOSTED +172.20.0.2 runtime BootstrapStatus control-plane 2 true +``` + +Talos machine configuration need to be updated to the 0.9 format; there are two new required machine configuration settings: + +* `.cluster.serviceAccount` is the service account PEM-encoded private key. +* `.cluster.aggregatorCA` is the aggregator CA for `kube-apiserver` (certficiate and private key). + +Current service account can be fetched from the Kubernetes secrets: + +```bash +$ kubectl -n kube-system get secrets kube-controller-manager -o jsonpath='{.data.service\-account\.key}' +LS0tLS1CRUdJTiBSU0EgUFJJVkFURS... +``` + +All control plane node machine configurations should be patched with the service account key: + +```bash +$ talosctl -n ,,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/serviceAccount", "value": {"key": "LS0tLS1CRUdJTiBSU0EgUFJJVkFURS..."}}]' +patched mc at the node 172.20.0.2 +``` + +Aggregator CA can be generated using OpenSSL or any other certificate generation tools: RSA or ECDSA certificate with CN `front-proxy` valid for 10 years. +PEM-encoded CA certificate and key should be base64-encoded and patched into the machine config at path `/cluster/aggregatorCA`: + +```bash +$ talosctl -n ,,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/aggregatorCA", "value": {"crt": "S0tLS1CRUdJTiBDRVJUSUZJQ...", "key": "LS0tLS1CRUdJTiBFQy..."}}]' +patched mc at the node 172.20.0.2 +``` + +At this point static pod definitions and bootstrap manifests should be rendered, please see "Automated Conversion" on how to verify generated objects. +Feel free to continue to refine your machine configuration until the generated static pod definitions and bootstrap manifests look good. + +If static pod definitions are not generated, check logs with `talosctl -n logs controller-runtime`. + +Disable `pod-checkpointer` with: + +```bash +$ kubectl -n kube-system delete ds pod-checkpointer +daemonset.apps "pod-checkpointer" deleted +``` + +Wait for all pod checkpoints to be removed: + +```bash +$ kubectl -n kube-system get pods +NAME READY STATUS RESTARTS AGE +... +pod-checkpointer-8q2lh-talos-default-master-2 1/1 Running 0 3m34s +pod-checkpointer-nnm5w-talos-default-master-3 1/1 Running 0 3m24s +pod-checkpointer-qnmdt-talos-default-master-1 1/1 Running 0 2m21s +``` + +Pod checkpoints have annotation `checkpointer.alpha.coreos.com/checkpoint-of`. + +Once all the pod checkpoints are removed (it takes 5 minutes for the checkpoints to be removed), proceed by removing self-hosted initialized key: + +```bash +talosctl -n convert-k8s --remove-initialized-key +``` + +Talos controllers will now render static pod definitions, and the kubelet will launch any resulting static pods. + +Once static pods are visible in `kubectl get pods -n kube-system` output, proceed by removing each of the self-hosted daemonsets: + +```bash +$ kubectl -n kube-system delete daemonset kube-apiserver +daemonset.apps "kube-apiserver" deleted +``` + +Make sure static pods for `kube-apiserver` got started successfully, pods are running and ready. + +Proceed by deleting `kube-controller-manager` and `kube-scheduler` daemonsets, verifying that static pods are running between each step: + +```bash +$ kubectl -n kube-system delete daemonset kube-controller-manager +daemonset.apps "kube-controller-manager" deleted +``` + +```bash +$ kubectl -n kube-system delete daemonset kube-scheduler +daemonset.apps "kube-scheduler" deleted +``` diff --git a/website/content/docs/v0.9/Guides/upgrading-talos.md b/website/content/docs/v0.9/Guides/upgrading-talos.md index 7112a5a944..f0f41e2215 100644 --- a/website/content/docs/v0.9/Guides/upgrading-talos.md +++ b/website/content/docs/v0.9/Guides/upgrading-talos.md @@ -3,7 +3,8 @@ title: Upgrading Talos --- Talos upgrades are effected by an API call. -The `talosctl` CLI utility will facilitate this, or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager). +The `talosctl` CLI utility will facilitate this. + ## Video Walkthrough @@ -11,6 +12,45 @@ To see a live demo of this writeup, see the video below: +## Upgrading from Talos 0.8 + +Talos 0.9 drops support for `bootkube` and self-hosted control plane. + +Please make sure Talos is upgraded to the latest minor release of 0.8 first (0.8.4 at the moment +of this writing), then proceed with upgrading to the latest minor release of 0.9. + +### Before Upgrade to 0.9 + +If cluster was bootstrapped on Talos version < 0.8.3, add checkpointer annotations to +the `kube-scheduler` and `kube-controller-manager` daemonsets to improve resiliency of +self-hosted control plane to reboots (this is critical for single control-plane node clusters): + +```bash +$ kubectl -n kube-system patch daemonset kube-controller-manager --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]' +daemonset.apps/kube-controller-manager patched +$ kubectl -n kube-system patch daemonset kube-scheduler --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]' +daemonset.apps/kube-scheduler patched +``` + +Talos 0.9 only supports Kubernetes versions 1.19.x and 1.20.x. +If running 1.18.x, please upgrade Kubernetes before upgrading Talos. + +Make sure cluster is running latest minor release of Talos 0.8. + +Prepare by downloading `talosctl` binary for Talos release 0.9.x. + +### After Upgrade to 0.9 + +After the upgrade to 0.9, Talos will still be running self-hosted control plane until the [conversion process](../converting-control-plane/) is run. + +> Note: Talos 0.9 doesn't include bootkube recovery option (`talosctl recover`), so +> it's not possible to recover self-hosted control plane after upgrading to 0.9. + +As soon as all the nodes get upgraded to 0.9, run `talosctl convert-k8s` to convert the control plane +to the new static pod format for 0.9. + +Once the conversion process is complete, Kubernetes can be upgraded. + ## `talosctl` Upgrade To manually upgrade a Talos node, you will specify the node's IP address and the @@ -29,6 +69,10 @@ There is an option to this command: `--preserve`, which can be used to explicitl In most cases, it is correct to just let Talos perform its default action. However, if you are running a single-node control-plane, you will want to make sure that `--preserve=true`. +If Talos fails to run the upgrade, the `--staged` flag may be used to perform the upgrade after a reboot +which is followed by another reboot to upgraded version. + + -## Changelog and Upgrade Notes +## Machine Configuration Changes -In an effort to create more production ready clusters, Talos will now taint control plane nodes as unschedulable. -This means that any application you might have deployed must tolerate this taint if you intend on running the application on control plane nodes. +Talos 0.9 introduces new required parameters in machine configuration: -Another feature you will notice is the automatic uncordoning of nodes that have been upgraded. -Talos will now uncordon a node if the cordon was initiated by the upgrade process. +* `.cluster.aggregatorCA` +* `.cluster.serviceAccount` -### Talosctl +Talos supports both ECDSA and RSA certificates and keys for Kubernetes and etcd, with ECDSA being default. +Talos <= 0.8 supports only RSA keys and certificates. -The `talosctl` CLI now requires an explicit set of nodes. -This can be configured with `talos config nodes` or set on the fly with `talos --nodes`. +Utility `talosctl gen config` generates by default config in 0.9 format which is not compatible with +Talos 0.8, but old format can be generated with `talosctl gen config --talos-version=v0.8`. diff --git a/website/content/docs/v0.9/Learn More/upgrades.md b/website/content/docs/v0.9/Learn More/upgrades.md index d5419636ae..00c9571217 100644 --- a/website/content/docs/v0.9/Learn More/upgrades.md +++ b/website/content/docs/v0.9/Learn More/upgrades.md @@ -109,19 +109,3 @@ automatically? **A.** Yes. We provide the [Talos Controller Manager](https://github.com/talos-systems/talos-controller-manager) to perform this maintenance in a simple, controllable fashion. - -## Upgrade Notes for Talos 0.8 - -Talos 0.8 comes with new [KSPP requirements](https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings) compliance check. - -Following kernel arguments are mandatory for Talos to boot successfully: - -- `init_on_alloc=1`: required by KSPP -- `slab_nomerge`: required by KSPP -- `pti=on`: required by KSPP - -Talos installer automatically injects those args while installing Talos, so this mostly is required when PXE booting Talos. - -## Kubernetes - -Kubernetes upgrades with Talos are covered in a [separate document](../../guides/upgrading-kubernetes/).