Upgrade GKE due to 2 CVEs #511

willingc · 2018-03-12T20:47:05Z

Although @yuvipanda's work on security prior to these CVEs prevent us from being impacted, his recommendation is to go ahead and upgrade (though no hair on fire rush).

Reference

The Kubernetes project recently disclosed new security vulnerabilities CVE-2017-1002101 and CVE-2017-1002102, allowing containers to access files outside the container. All Google Kubernetes Engine (GKE) nodes are affected by these vulnerabilities, and we recommend that you upgrade to the latest patch version as soon as possible, as we detail below.

What should I do?

Due to the severity of these vulnerabilities, whether you have node-autoupgrade enabled or not, we recommend that you manually upgrade your nodes as soon as the patch becomes available. The patch will be available for all customers by March 16th, but it may be available for you sooner based on the zone your cluster is in, according to the rollout schedule.

minrk · 2018-03-13T10:29:06Z

We can do the upgrade with:

gcloud container clusters upgrade prod-a --master --cluster-version=1.8.9-gke.0
gcloud container clusters upgrade prod-a --cluster-version=1.8.9-gke.0

The one thing I don't know is if we will see any downtime when we do this. I know everything stays running when this happens, but I don't know if we can still schedule new pods. If we can, then we should be able to do this at any time.

betatim · 2018-03-13T10:53:21Z

1.8.9 isn't yet available in our zone (us-central1-a), we can check available versions with: gcloud container get-server-config --zone us-central1-a. 1.8.9 should become available by March 16th at the latest.

minrk · 2018-03-15T10:20:36Z

From the release notes, this will be March 16 for us (tomorrow). I think we should take this course of action tomorrow:

We should first perform the upgrade on staging, then proceed to prod once staging is finished. It's unclear to me whether launching new pods is available while master is being upgraded. If not, there will be a period of downtime for new launches while we upgrade master. It should only be a few minutes.

Upgrading prod nodes will take a long time because nodes are only replaced with upgrades once they are drained. This means we will rely on better culling of user pods on these nodes in order for them to be upgraded. I think this will require manual intervention as the upgrade process cordons nodes, we are going to have to reach in and kill all the user pods still on those nodes after some maximum age (maybe one or two hours?).

choldgraf · 2018-03-15T10:30:44Z

sounds like a good plan to me, I can try to keep an eye on the pod/node distribution and delete as necessary --

…

On Thu, Mar 15, 2018 at 10:20 AM Min RK ***@***.***> wrote: From the release notes <https://cloud.google.com/kubernetes-engine/release-notes>, this will be March 16 for us (tomorrow). I think we should take this course of action tomorrow: We should first perform the upgrade on staging, then proceed to prod once staging is finished. It's unclear to me whether launching new pods is available while master is being upgraded. If not, there will be a period of downtime for new launches while we upgrade master. It should only be a few minutes. Upgrading prod nodes will take a *long* time because nodes are only replaced with upgrades once they are drained. This means we will rely on better culling of user pods on these nodes in order for them to be upgraded. I think this will require manual intervention as the upgrade process cordons nodes, we are going to have to reach in and kill all the user pods still on those nodes after some maximum age (maybe one or two hours?). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#511 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABwSHWo-kjTSVCB2ngJSAJJecekYUuv8ks5tekB0gaJpZM4SneI2> .

yuvipanda · 2018-03-15T17:14:55Z

Yes, master upgrades will cause an outage. We should switch to regional clusters ( https://cloud.google.com/kubernetes-engine/docs/concepts/multi-zone-and-regional-clusters) at some point which allow 0-downtime upgrades and autoscaler changes. Unfortunately that requires creating a cluster from scratch... On Thu, Mar 15, 2018 at 3:30 AM, Chris Holdgraf <notifications@github.com> wrote:

…

sounds like a good plan to me, I can try to keep an eye on the pod/node distribution and delete as necessary -- On Thu, Mar 15, 2018 at 10:20 AM Min RK ***@***.***> wrote: > From the release notes > <https://cloud.google.com/kubernetes-engine/release-notes>, this will be > March 16 for us (tomorrow). I think we should take this course of action > tomorrow: > > We should first perform the upgrade on staging, then proceed to prod once > staging is finished. It's unclear to me whether launching new pods is > available while master is being upgraded. If not, there will be a period of > downtime for new launches while we upgrade master. It should only be a few > minutes. > > Upgrading prod nodes will take a *long* time because nodes are only > replaced with upgrades once they are drained. This means we will rely on > better culling of user pods on these nodes in order for them to be > upgraded. I think this will require manual intervention as the upgrade > process cordons nodes, we are going to have to reach in and kill all the > user pods still on those nodes after some maximum age (maybe one or two > hours?). > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#511# issuecomment-373328021>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABwSHWo- kjTSVCB2ngJSAJJecekYUuv8ks5tekB0gaJpZM4SneI2> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#511 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB23oFItt-UMaoeaAWdeW2gdjgxK51fks5tekLbgaJpZM4SneI2> .

-- Yuvi Panda T http://yuvi.in/blog

minrk · 2018-03-16T08:42:15Z

Thanks. AM Europe seems like the best time to do the master upgrade, then. 1.8.9 isn't available yet, but I'll keep an eye on it and begin the upgrade if it shows up soon.

minrk · 2018-03-16T12:15:11Z

Since it didn't happen AM today, we can either try to do the upgrade over the weekend, or start it Monday AM CET. I likely won't be around much of the weekend, and won't expect anyone else to be, either. My understanding of this security issue is that waiting until Monday ought to be fine. Since that's still Sunday evening in the West, I suspect that's probably close to our lowest traffic time.

minrk · 2018-03-19T08:50:27Z

Beginning upgrade of staging now

minrk · 2018-03-19T13:00:35Z

Staging and prod are updated to 1.8.9-gke.1. Staging deploy revealed an incompatibility with the grafana docker image and the security fix, which required using a custom image, since upstream has not yet merged a fix (#518). Deploy went smoothly after that, though I believe a number of users (~100) were kicked off of their Binders. It's unclear how many of those were 'active', given the current state of culling.

It was my understanding that kubernetes would drain nodes and only kill them when they become unoccupied. This was not the case, as can be seen in the discontinuities in the pods-per-node graph:

The upgrade began at 11:50 CET, with bmsw the first to be cordoned. After ~10 minutes, that node was culled and upgraded with ~85 user pods still running. The second node to be upgraded was z2hg, which drained for suspiciously close to exactly one hour. It was deleted and replaced with 17 user pods still running. My guess is that there's a draining timeout of one hour after which point kubernetes kills any pods that are preventing the upgrade. There may have been a malfunction in this mechanism that allowed bmsw to be upgraded prematurely.

The production upgrade occured with two to-be-upgraded nodes (three total nodes, but one added by the autoscaler this morning was already upgraded) took 78 minutes. I suspect it should have taken two hours, one hour per node.

choldgraf · 2018-03-19T13:09:05Z

shall we make this an incident report? not really an "error" in some sense, but useful knowledge I think --

…

On Mon, Mar 19, 2018 at 1:00 PM Min RK ***@***.***> wrote: Closed #511 <#511> . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#511 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABwSHXeWWyTP2DbLMgrRTC2bFs8y7-mNks5tf6wAgaJpZM4SneI2> .

willingc added security site reliability labels Mar 12, 2018

minrk mentioned this issue Mar 19, 2018

Put grafana back on an official image #519

Closed

minrk closed this as completed Mar 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade GKE due to 2 CVEs #511

Upgrade GKE due to 2 CVEs #511

willingc commented Mar 12, 2018

minrk commented Mar 13, 2018

betatim commented Mar 13, 2018

minrk commented Mar 15, 2018

choldgraf commented Mar 15, 2018 via email

yuvipanda commented Mar 15, 2018 via email

minrk commented Mar 16, 2018

minrk commented Mar 16, 2018

minrk commented Mar 19, 2018

minrk commented Mar 19, 2018

choldgraf commented Mar 19, 2018 via email

Upgrade GKE due to 2 CVEs #511

Upgrade GKE due to 2 CVEs #511

Comments

willingc commented Mar 12, 2018

minrk commented Mar 13, 2018

betatim commented Mar 13, 2018

minrk commented Mar 15, 2018

choldgraf commented Mar 15, 2018 via email

yuvipanda commented Mar 15, 2018 via email

minrk commented Mar 16, 2018

minrk commented Mar 16, 2018

minrk commented Mar 19, 2018

minrk commented Mar 19, 2018

choldgraf commented Mar 19, 2018 via email