Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade GKE due to 2 CVEs #511

Closed
willingc opened this issue Mar 12, 2018 · 10 comments
Closed

Upgrade GKE due to 2 CVEs #511

willingc opened this issue Mar 12, 2018 · 10 comments

Comments

@willingc
Copy link
Contributor

Although @yuvipanda's work on security prior to these CVEs prevent us from being impacted, his recommendation is to go ahead and upgrade (though no hair on fire rush).


Reference

The Kubernetes project recently disclosed new security vulnerabilities CVE-2017-1002101 and CVE-2017-1002102, allowing containers to access files outside the container. All Google Kubernetes Engine (GKE) nodes are affected by these vulnerabilities, and we recommend that you upgrade to the latest patch version as soon as possible, as we detail below.

What should I do?

Due to the severity of these vulnerabilities, whether you have node-autoupgrade enabled or not, we recommend that you manually upgrade your nodes as soon as the patch becomes available. The patch will be available for all customers by March 16th, but it may be available for you sooner based on the zone your cluster is in, according to the rollout schedule.

@minrk
Copy link
Member

minrk commented Mar 13, 2018

We can do the upgrade with:

gcloud container clusters upgrade prod-a --master --cluster-version=1.8.9-gke.0
gcloud container clusters upgrade prod-a --cluster-version=1.8.9-gke.0

The one thing I don't know is if we will see any downtime when we do this. I know everything stays running when this happens, but I don't know if we can still schedule new pods. If we can, then we should be able to do this at any time.

@betatim
Copy link
Member

betatim commented Mar 13, 2018

1.8.9 isn't yet available in our zone (us-central1-a), we can check available versions with: gcloud container get-server-config --zone us-central1-a. 1.8.9 should become available by March 16th at the latest.

@minrk
Copy link
Member

minrk commented Mar 15, 2018

From the release notes, this will be March 16 for us (tomorrow). I think we should take this course of action tomorrow:

We should first perform the upgrade on staging, then proceed to prod once staging is finished. It's unclear to me whether launching new pods is available while master is being upgraded. If not, there will be a period of downtime for new launches while we upgrade master. It should only be a few minutes.

Upgrading prod nodes will take a long time because nodes are only replaced with upgrades once they are drained. This means we will rely on better culling of user pods on these nodes in order for them to be upgraded. I think this will require manual intervention as the upgrade process cordons nodes, we are going to have to reach in and kill all the user pods still on those nodes after some maximum age (maybe one or two hours?).

@choldgraf
Copy link
Member

choldgraf commented Mar 15, 2018 via email

@yuvipanda
Copy link
Contributor

yuvipanda commented Mar 15, 2018 via email

@minrk
Copy link
Member

minrk commented Mar 16, 2018

Thanks. AM Europe seems like the best time to do the master upgrade, then. 1.8.9 isn't available yet, but I'll keep an eye on it and begin the upgrade if it shows up soon.

@minrk
Copy link
Member

minrk commented Mar 16, 2018

Since it didn't happen AM today, we can either try to do the upgrade over the weekend, or start it Monday AM CET. I likely won't be around much of the weekend, and won't expect anyone else to be, either. My understanding of this security issue is that waiting until Monday ought to be fine. Since that's still Sunday evening in the West, I suspect that's probably close to our lowest traffic time.

@minrk
Copy link
Member

minrk commented Mar 19, 2018

Beginning upgrade of staging now

@minrk
Copy link
Member

minrk commented Mar 19, 2018

Staging and prod are updated to 1.8.9-gke.1. Staging deploy revealed an incompatibility with the grafana docker image and the security fix, which required using a custom image, since upstream has not yet merged a fix (#518). Deploy went smoothly after that, though I believe a number of users (~100) were kicked off of their Binders. It's unclear how many of those were 'active', given the current state of culling.

It was my understanding that kubernetes would drain nodes and only kill them when they become unoccupied. This was not the case, as can be seen in the discontinuities in the pods-per-node graph:

screen shot 2018-03-19 at 13 52 51

The upgrade began at 11:50 CET, with bmsw the first to be cordoned. After ~10 minutes, that node was culled and upgraded with ~85 user pods still running. The second node to be upgraded was z2hg, which drained for suspiciously close to exactly one hour. It was deleted and replaced with 17 user pods still running. My guess is that there's a draining timeout of one hour after which point kubernetes kills any pods that are preventing the upgrade. There may have been a malfunction in this mechanism that allowed bmsw to be upgraded prematurely.

The production upgrade occured with two to-be-upgraded nodes (three total nodes, but one added by the autoscaler this morning was already upgraded) took 78 minutes. I suspect it should have taken two hours, one hour per node.

@minrk minrk closed this as completed Mar 19, 2018
@choldgraf
Copy link
Member

choldgraf commented Mar 19, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants