Migrate to Fedora CoreOS #50

beyondbill · 2020-07-31T19:45:05Z

Migrate everything customized in this fork from aws/container-linux to aws/fedora-coreos
Update this fork with the latest upstream master. Notable changes are:
- Kubernetes v1.18.2 -> v1.18.8
- Terraform v0.12.x -> v0.13.x
  - Terraform v0.12.26+ compatibility
  - Require terraform-provider-ct v0.6.1
- Rename controller NoSchedule taint from node-role.kubernetes.io/master to node-role.kubernetes.io/controller
- Remove node label node.kubernetes.io/master from controller nodes (use node.kubernetes.io/controller instead)
- Deprecate CoreOS Container Linux support (use Flactcar instead)

TODOs:

Update git commit after merging Merge in latest upstream master branch terraform-render-bootstrap#6

* https://grafana.com/docs/guides/whats-new-in-v6-5/

* https://github.com/grafana/grafana/releases/tag/v6.5.1

* Fix controller and worker ipv4/ipv4 outputs to be lists of strings * With Terraform v0.11 syntax, an enclosing list was required to coerce the output to be a list of strings * With Terraform v0.12 syntax, the enclosing list shouldn't be needed

* Allow generated assets (TLS materials, manifests) to be securely distributed to controller node(s) via file provisioner (i.e. ssh-agent) as an assets bundle file, rather than relying on assets being locally rendered to disk in an asset_dir and then securely distributed * Change `asset_dir` from required to optional. Left unset, asset_dir defaults to "" and no assets will be written to files on the machine that runs terraform apply * Enhancement: Managed cluster assets are kept only in Terraform state, which supports different backends (GCS, S3, etcd, etc) and optional encryption. terraform apply accesses state, runs in-memory, and distributes sensitive materials to controllers without making use of local disk (simplifies use in CI systems) * Enhancement: Improve asset unpack and layout process to position etcd certificates and control plane certificates more cleanly, without unneeded secret materials Details: * Terraform file provisioner support for distributing directories of contents (with unknown structure) has been limited to reading from a local directory, meaning local writes to asset_dir were required. poseidon#585 discusses the problem and newer or upcoming Terraform features that might help. * Observation: Terraform provisioner support for single files works well, but iteration isn't viable. We're also constrained to Terraform language features on the apply side (no extra plugins, no shelling out) and CoreOS / Fedora tools on the receive side. * Take a map representation of the contents that would have been splayed out in asset_dir and pack/encode them into a single file format devised for easy unpacking. Use an awk one-liner on the receive side to unpack. In pratice, this has worked well and its rather nice that a single assets file is transferred by file provisioner (all or none) Rel: poseidon/terraform-render-bootstrap#162

* Original tutorials favored including the platform (e.g. google-cloud) in modules (e.g. google-cloud-yavin). Prefer naming conventions where each module / cluster has a simple name (e.g. yavin) since the platform is usually redundant * Retain the example cluster naming themes per platform

* Stop mapping node labels to targets discovered via Kubernetes nodes (e.g. etcd, kubelet, cadvisor). It is rarely useful to store node labels (e.g. kubernetes.io/os=linux) on these metrics * kube-apiserver's apiserver_request_duration_seconds_bucket metric has a high cardinality that includes labels for the API group, verb, scope, resource, and component for each object type, including for each CRD. This one metric has ~10k time series in a typical cluster (btw 10-40% of total) * Removing the apiserver request duration outright would make latency alerts a NoOp and break a Grafana apiserver panel. Instead, drop series that have a "group" label. Effectively, only request durations for core Kubernetes APIs will be kept (e.g. cardinality won't grow with each CRD added). This reduces the metric to ~2k unique series

* Reduce time to delete pods on unready nodes from 5m to 1m * Present since v1.13.3, but mistakenly removed in v1.16.0 static pod control plane migration Related: * poseidon/terraform-render-bootstrap#148 * poseidon/terraform-render-bootstrap#164

* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md/#v1170

* Binary asset locations within the upstream hyperkube image changed kubernetes/kubernetes#84662 * Fix Container Linux and Flatcar Linux kubelet.service (rkt-fly with fairly dated CoreOS kubelet-wrapper) * Fix Fedora CoreOS kubelet.service (podman) * Fix Fedora CoreOS bootstrap.service * Fix delete-node kubectl usage for workers where nodes may delete themselves on shutdown (e.g. preemptible instances)

* https://docs.projectcalico.org/v3.10/release-notes/

* Update recommended Terraform and provider plugin versions * Update the rough count of resources created per cluster since its not been refreshed in a while (will vary based on cluster options)

* https://github.com/grafana/grafana/releases/tag/v6.5.2

* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.0

* Allow the raw kubelet kubeconfig to be consumed via Terraform output

* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.0

* https://coredns.io/2019/12/11/coredns-1.6.6-release/

* https://github.com/prometheus/prometheus/releases/tag/v2.15.0

* https://github.com/prometheus/prometheus/releases/tag/v2.15.1

* https://docs.projectcalico.org/v3.11/release-notes/

* Rename Container Linux Config (CLC) files to *.yaml to align with Fedora CoreOS Config (FCC) files and for syntax highlighting * Replace common uses of Terraform `element` (which wraps around) with `list[index]` syntax to surface index errors

* Change kubelet.service on Container Linux nodes to ExecStart Kubelet inline to replace the use of the host OS kubelet-wrapper script * Express rkt run flags and volume mounts in a clear, uniform way to make the Kubelet service easier to audit, manage, and understand * Eliminate reliance on a Container Linux kubelet-wrapper script * Typhoon for Fedora CoreOS developed a kubelet.service that similarly uses an inline ExecStart (except with podman instead of rkt) and a more minimal set of volume mounts. Adopt the volume improvements: * Change Kubelet /etc/kubernetes volume to read-only * Change Kubelet /etc/resolv.conf volume to read-only * Remove unneeded /var/lib/cni volume mount Background: * kubelet-wrapper was added in CoreOS around the time of Kubernetes v1.0 to simplify running a CoreOS-built hyperkube ACI image via rkt-fly. The script defaults are no longer ideal (e.g. rkt's notion of trust dates back to quay.io ACI image serving and signing, which informed the OCI standard images we use today, though they still lack rkt's signing ideas). * Shipping kubelet-wrapper was regretted at CoreOS, but remains in the distro for compatibility. The script is not updated to track hyperkube changes, but it is stable and kubelet.env overrides bridge most gaps * Typhoon Container Linux nodes have used kubelet-wrapper to rkt/rkt-fly run the Kubelet via the official k8s.gcr.io hyperkube image using overrides (new image registry, new image format, restart handling, new mounts, new entrypoint in v1.17). * Observation: Most of what it takes to run a Kubelet container is defined in Typhoon, not in kubelet-wrapper. The wrapper's value is now undermined by having to workaround its dated defaults. Typhoon may be better served defining Kubelet.service explicitly * Typhoon for Fedora CoreOS developed a kubelet.service without the use of a host OS kubelet-wrapper which is both clearer and eliminated some volume mounts

* Kubelet runs a healthz server listening on 127.0.0.1:10248 by default. Its unused by Typhoon and can be disabled * https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

* Configure kube-proxy --metrics-bind-address=0.0.0.0 (default 127.0.0.1) to serve metrics on 0.0.0.0:10249 * Add firewall rules to allow Prometheus (resides on a worker) to scrape kube-proxy service endpoints on controllers or workers * Add a clusterIP: None service for kube-proxy endpoint discovery

* Change node-exporter DaemonSet tolerations from tolerating all possible NoSchedule taints to tolerating the master taint and the not ready taint (we'd like metrics regardless) * Users who add custom node taints must add their custom taints to the addon node-exporter DaemonSet. As an addon, its expected users copy and manipulate manifests out-of-band in their own systems

* Inlining the Kubelet service removed the need for the kubelet.env file declared in Ignition. However, on some platforms, this removed the guarantee that /etc/kubernetes exists. Bare-Metal and DigitalOcean distribute the kubelet kubeconfig through Terraform file provisioner (scp) and place it in (now missing) /etc/kubernetes * poseidon#606 * Fix bare-metal and DigitalOcean Ignition to ensure the desired directory exists following first boot from disk * Cloud platforms with worker pools distribute the kubeconfig through Ignition user data (no impact or need)

* https://github.com/prometheus/prometheus/releases/tag/v2.15.2

* Typhoon Google Cloud is compatible with `terraform-provider-google` v3.x releases * No v3.x specific features are used, so v2.19+ provider versions are still allowed, to ease migrations

…ora-coreos

* https://github.com/grafana/grafana/releases/tag/v7.1.3 * https://github.com/grafana/grafana/releases/tag/v7.1.2

* Typhoon AWS is compatible with terraform-provider-aws v3.x releases * Continue to allow v2.23+, no v3.x specific features are used * Set required provider versions in the worker module, since it can be used independently Related: * https://github.com/terraform-providers/terraform-provider-aws/releases/tag/v3.0.0

* Sync Terraform provider plugin versions to those used internally

* Recommend Terraform v0.13.x * Support automatic install of poseidon's provider plugins * Update tutorial docs for Terraform v0.13.x * Add migration guide for Terraform v0.13.x (best-effort) * Require Terraform v0.12.26+ (migration compatibility) * Require `terraform-provider-ct` v0.6.1 * Require `terraform-provider-matchbox` v0.4.1 * Require `terraform-provider-digitalocean` v1.20+ Related: * https://www.hashicorp.com/blog/announcing-hashicorp-terraform-0-13/ * https://www.terraform.io/upgrade-guides/0-13.html * https://registry.terraform.io/providers/poseidon/ct/latest * https://registry.terraform.io/providers/poseidon/matchbox/latest

* Mention the first master branch SHA that introduced Terraform v0.13 forward compatibility * Link the migration guide on Github until a release is available and website docs are published

* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1188

* Sync Terraform provider plugin versions to those used internally * Update mkdocs-material from v5.5.1 to v5.5.6 * Fix minor details in docs

…ng to kubelet

…ora-coreos

bendrucker

Amazing! I found the easiest way to pick through this was this:

poseidon/typhoon@master...TakeScoop:fedora-coreos

That compares this branch to poseidon/master, which is handy for checking things like --cloud-provider=aws being added.

The only thing I caught is an output/ directory that's committed, otherwise all my spot checks looked good!

beyondbill · 2020-08-21T04:03:15Z

@bendrucker The output/ folder has been there for over 2 years. Good catch! Will remove.

dghubble and others added 30 commits November 25, 2019 22:45

Update Grafana from v6.4.4 to v6.5.0

030a4ce

* https://grafana.com/docs/guides/whats-new-in-v6-5/

Update Grafana from v6.5.0 to v6.5.1

2667408

* https://github.com/grafana/grafana/releases/tag/v6.5.1

Update mkdocs-material from v4.5.0 to v4.5.1

5fa002f

Update Kubernetes from v1.16.3 to v1.17.0

de36d99

* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md/#v1170

Update Calico from v3.10.1 to v3.10.2

c0ce04e

* https://docs.projectcalico.org/v3.10/release-notes/

Update CHANGES and tutorial notes for release

f69dc2e

* Update recommended Terraform and provider plugin versions * Update the rough count of resources created per cluster since its not been refreshed in a while (will vary based on cluster options)

Fix minor example typo in README

c3e22f3

Update mkdocs-material from v4.5.1 to v4.6.0

2d8e367

Update Grafana from v6.5.1 to v6.5.2

1b9fa2e

* https://github.com/grafana/grafana/releases/tag/v6.5.2

Update kube-state-metrics from v1.8.0 to v1.9.0-rc.1

0ecb995

* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.0

Add Kubelet kubeconfig output for DigitalOcean

00c431a

* Allow the raw kubelet kubeconfig to be consumed via Terraform output

Update CoreDNS from v1.6.5 to v1.6.6

daa8d9d

* https://coredns.io/2019/12/11/coredns-1.6.6-release/

Update Prometheus from v2.14.0 to v2.15.0

f48e43c

* https://github.com/prometheus/prometheus/releases/tag/v2.15.0

Update Prometheus from v2.15.0 to v2.15.1

a4e8436

* https://github.com/prometheus/prometheus/releases/tag/v2.15.1

Update Calico from v3.10.2 to v3.11.1

11565ff

* https://docs.projectcalico.org/v3.11/release-notes/

Disable Kubelet 127.0.0.1.10248 healthz endpoint

b2eb3e0

* Kubelet runs a healthz server listening on 127.0.0.1:10248 by default. Its unused by Typhoon and can be disabled * https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Update Prometheus from v2.15.1 to v2.15.2

73588cf

* https://github.com/prometheus/prometheus/releases/tag/v2.15.2

Allow terraform-provider-google v3.x plugin versions

b1f521f

* Typhoon Google Cloud is compatible with `terraform-provider-google` v3.x releases * No v3.x specific features are used, so v2.19+ provider versions are still allowed, to ease migrations

beyondbill and others added 18 commits August 5, 2020 15:34

Support Fedora CoreOS OS image streams on AWS

14b54e5

Merge branch 'master' of https://github.com/poseidon/typhoon into fed…

09faa19

…ora-coreos

fix mistakes in resolving merging conflicts

7bc8066

add new security components

79fe856

fix json format

d326a67

Update Grafana from v7.1.1 to v7.1.3

e1d6ab2

* https://github.com/grafana/grafana/releases/tag/v7.1.3 * https://github.com/grafana/grafana/releases/tag/v7.1.2

Update recommended Terraform provider versions

aab0713

* Sync Terraform provider plugin versions to those used internally

fix ssl cert mounts

6fa4135

apiserver nlb should be internal

9b7e268

update terraform-render-bootstrap with latest upstream

0dea2b3

Update Terraform migration guide SHA

342380c

* Mention the first master branch SHA that introduced Terraform v0.13 forward compatibility * Link the migration guide on Github until a release is available and website docs are published

Update Kubernetes from v1.18.6 to v1.18.8

c87db3e

* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1188

Update recommended Terraform provider versions

9a07f1d

* Sync Terraform provider plugin versions to those used internally * Update mkdocs-material from v5.5.1 to v5.5.6 * Fix minor details in docs

try relabeling /etc/kubernetes/bootstrap-secrets by explicitly mounti…

e39ffcd

…ng to kubelet

relabeling does not need explicitly mounting to kubelet

4b478f4

Merge branch 'master' of https://github.com/poseidon/typhoon into fed…

41ba846

…ora-coreos

beyondbill marked this pull request as ready for review August 19, 2020 21:47

beyondbill requested a review from bendrucker August 19, 2020 21:48

beyondbill added 3 commits August 19, 2020 19:01

need to update the type label of bootstrap-secret in the newest typhoon

b91b993

update terraform-render-bootstrap with latest upstream

00244cc

rm unnecessary volume mounts on etcd

62b91be

bendrucker reviewed Aug 21, 2020

View reviewed changes

rm output/

bfc03e1

beyondbill requested a review from bendrucker August 21, 2020 04:04

bendrucker approved these changes Aug 21, 2020

View reviewed changes

beyondbill merged commit efcce41 into master Aug 21, 2020

beyondbill deleted the fedora-coreos branch August 21, 2020 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Fedora CoreOS #50

Migrate to Fedora CoreOS #50

beyondbill commented Jul 31, 2020 •

edited

Loading

bendrucker left a comment

beyondbill commented Aug 21, 2020

Migrate to Fedora CoreOS #50

Migrate to Fedora CoreOS #50

Conversation

beyondbill commented Jul 31, 2020 • edited Loading

bendrucker left a comment

Choose a reason for hiding this comment

beyondbill commented Aug 21, 2020

beyondbill commented Jul 31, 2020 •

edited

Loading