Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safer-cluster-update-variant always forces upgrade to latest GKE version. #486

Closed
skinlayers opened this issue Apr 13, 2020 · 1 comment
Closed

Comments

@skinlayers
Copy link
Contributor

Transfered from #451

@morgante I traced this out last week, and found that it does not appear to conform to the GKE hardening recommendations.

https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster#upgrade_your_infrastructure_in_a_timely_fashion_default_2019-11-11

"CIS GKE Benchmark Recommendation: 6.5.3. Ensure Node Auto-Upgrade is enabled for GKE nodes
[...]
In Google Kubernetes Engine, the masters are patched and upgraded for you automatically. Node auto-upgrade also automatically upgrades nodes in your cluster."

https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-upgrades

"Node pools with auto-upgrades enabled are scheduled for upgrades when they meet the selection criteria (announced in the release notes). Rollouts are phased across multiple weeks to ensure cluster and fleet stability.
[...]
Note: Enabling auto-upgrades does not cause your nodes to upgrade immediately. For more information, see Cluster and node upgrades."

https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades#upgrading_automatically

"When you create a cluster using the Google Cloud Console, auto-upgrade is enabled on the cluster and its node pools by default, and Google upgrades your clusters when a new GKE version is selected for auto-upgrade.

When you create a cluster using the gcloud command or the GKE API, node auto-upgrade is currently enabled by default."

https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades#auto-upgrade-version-selection

"New GKE versions are released regularly, but a version is not selected for auto-upgrade right away. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions.

New auto-upgrade targets are announced in the release notes. Until an available version is selected for auto-upgrade, you can upgrade to it manually. Occasionally, a version is selected for cluster auto-upgrade and node auto-upgrade during different weeks."

https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels

"Regular:
Production clusters that need features not yet offered in the Stable channel.
These versions are considered production-quality. Known issues generally have known workarounds.

Stable:
Production clusters that require stability above all else, and for which frequent upgrades are too risky. These versions are considered production-quality, with historical data to indicate that they are stable and reliable in production."

https://cloud.google.com/kubernetes-engine/docs/release-notes
https://cloud.google.com/kubernetes-engine/docs/release-notes-regular
https://cloud.google.com/kubernetes-engine/docs/release-notes-stable

https://www.terraform.io/docs/providers/google/r/container_node_pool.html#version

"version - (Optional) The Kubernetes version for the nodes in this pool. Note that if this field and auto_upgrade are both specified, they will fight each other for what the node version should be, so setting both is highly discouraged. While a fuzzy version can be specified, it's recommended that you specify explicit versions as Terraform will see spurious diffs when fuzzy versions are used. See the google_container_engine_versions data source's version_prefix field to approximate fuzzy versions in a Terraform-compatible way."

Putting all of this together, there are a few problems that stand out:

  1. For many production environments, pinning the k8s version (and dependency versions in general) is a requirement to avoid always being forced onto the latest version without testing.
  2. The terraform docs say a node pool version should not be specified if auto-upgrade is also enabled. If the node pool version is set to "latest", then auto_upgrade is disabled (or conflicting with the specified version), and we're not conforming to the GKE hardening recommendations.
  3. Looking at the current output of gcloud container get-server-config --region us-central1, and comparing it to the GKE release notes reveals that currently:
  • us-central1 defaultClusterVersion: 1.14.10-gke.17
  • us-central1 "latest" validNodeVersions: 1.15.9-gke.12
  • stable channel: 1.14.10-gke.17
  • regular channel: 1.15.9-gke.8

We probably shouldn't be forcing version "latest" (1.15.9-gke.12), as it is ahead of the "regular" channel (1.15.9-gke.8). If we want something newer than the default "stable" channel 1.14.x releases, a better option would be to configure the cluster to use the "regular" channel with auto_upgrade, rather than explicitly setting the version.

@skinlayers
Copy link
Contributor Author

Resolved by #487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant