-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade GKE due to 2 CVEs #511
Comments
We can do the upgrade with:
The one thing I don't know is if we will see any downtime when we do this. I know everything stays running when this happens, but I don't know if we can still schedule new pods. If we can, then we should be able to do this at any time. |
1.8.9 isn't yet available in our zone (us-central1-a), we can check available versions with: |
From the release notes, this will be March 16 for us (tomorrow). I think we should take this course of action tomorrow: We should first perform the upgrade on staging, then proceed to prod once staging is finished. It's unclear to me whether launching new pods is available while master is being upgraded. If not, there will be a period of downtime for new launches while we upgrade master. It should only be a few minutes. Upgrading prod nodes will take a long time because nodes are only replaced with upgrades once they are drained. This means we will rely on better culling of user pods on these nodes in order for them to be upgraded. I think this will require manual intervention as the upgrade process cordons nodes, we are going to have to reach in and kill all the user pods still on those nodes after some maximum age (maybe one or two hours?). |
sounds like a good plan to me, I can try to keep an eye on the pod/node
distribution and delete as necessary
--
…On Thu, Mar 15, 2018 at 10:20 AM Min RK ***@***.***> wrote:
From the release notes
<https://cloud.google.com/kubernetes-engine/release-notes>, this will be
March 16 for us (tomorrow). I think we should take this course of action
tomorrow:
We should first perform the upgrade on staging, then proceed to prod once
staging is finished. It's unclear to me whether launching new pods is
available while master is being upgraded. If not, there will be a period of
downtime for new launches while we upgrade master. It should only be a few
minutes.
Upgrading prod nodes will take a *long* time because nodes are only
replaced with upgrades once they are drained. This means we will rely on
better culling of user pods on these nodes in order for them to be
upgraded. I think this will require manual intervention as the upgrade
process cordons nodes, we are going to have to reach in and kill all the
user pods still on those nodes after some maximum age (maybe one or two
hours?).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#511 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABwSHWo-kjTSVCB2ngJSAJJecekYUuv8ks5tekB0gaJpZM4SneI2>
.
|
Yes, master upgrades will cause an outage.
We should switch to regional clusters (
https://cloud.google.com/kubernetes-engine/docs/concepts/multi-zone-and-regional-clusters)
at some point which allow 0-downtime upgrades and autoscaler changes.
Unfortunately that requires creating a cluster from scratch...
On Thu, Mar 15, 2018 at 3:30 AM, Chris Holdgraf <notifications@github.com>
wrote:
… sounds like a good plan to me, I can try to keep an eye on the pod/node
distribution and delete as necessary
--
On Thu, Mar 15, 2018 at 10:20 AM Min RK ***@***.***> wrote:
> From the release notes
> <https://cloud.google.com/kubernetes-engine/release-notes>, this will be
> March 16 for us (tomorrow). I think we should take this course of action
> tomorrow:
>
> We should first perform the upgrade on staging, then proceed to prod once
> staging is finished. It's unclear to me whether launching new pods is
> available while master is being upgraded. If not, there will be a period
of
> downtime for new launches while we upgrade master. It should only be a
few
> minutes.
>
> Upgrading prod nodes will take a *long* time because nodes are only
> replaced with upgrades once they are drained. This means we will rely on
> better culling of user pods on these nodes in order for them to be
> upgraded. I think this will require manual intervention as the upgrade
> process cordons nodes, we are going to have to reach in and kill all the
> user pods still on those nodes after some maximum age (maybe one or two
> hours?).
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#511#
issuecomment-373328021>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABwSHWo-
kjTSVCB2ngJSAJJecekYUuv8ks5tekB0gaJpZM4SneI2>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#511 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB23oFItt-UMaoeaAWdeW2gdjgxK51fks5tekLbgaJpZM4SneI2>
.
--
Yuvi Panda T
http://yuvi.in/blog
|
Thanks. AM Europe seems like the best time to do the master upgrade, then. 1.8.9 isn't available yet, but I'll keep an eye on it and begin the upgrade if it shows up soon. |
Since it didn't happen AM today, we can either try to do the upgrade over the weekend, or start it Monday AM CET. I likely won't be around much of the weekend, and won't expect anyone else to be, either. My understanding of this security issue is that waiting until Monday ought to be fine. Since that's still Sunday evening in the West, I suspect that's probably close to our lowest traffic time. |
Beginning upgrade of staging now |
Staging and prod are updated to 1.8.9-gke.1. Staging deploy revealed an incompatibility with the grafana docker image and the security fix, which required using a custom image, since upstream has not yet merged a fix (#518). Deploy went smoothly after that, though I believe a number of users (~100) were kicked off of their Binders. It's unclear how many of those were 'active', given the current state of culling. It was my understanding that kubernetes would drain nodes and only kill them when they become unoccupied. This was not the case, as can be seen in the discontinuities in the pods-per-node graph: The upgrade began at 11:50 CET, with bmsw the first to be cordoned. After ~10 minutes, that node was culled and upgraded with ~85 user pods still running. The second node to be upgraded was z2hg, which drained for suspiciously close to exactly one hour. It was deleted and replaced with 17 user pods still running. My guess is that there's a draining timeout of one hour after which point kubernetes kills any pods that are preventing the upgrade. There may have been a malfunction in this mechanism that allowed bmsw to be upgraded prematurely. The production upgrade occured with two to-be-upgraded nodes (three total nodes, but one added by the autoscaler this morning was already upgraded) took 78 minutes. I suspect it should have taken two hours, one hour per node. |
shall we make this an incident report? not really an "error" in some sense,
but useful knowledge I think
--
…On Mon, Mar 19, 2018 at 1:00 PM Min RK ***@***.***> wrote:
Closed #511 <#511>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#511 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABwSHXeWWyTP2DbLMgrRTC2bFs8y7-mNks5tf6wAgaJpZM4SneI2>
.
|
Although @yuvipanda's work on security prior to these CVEs prevent us from being impacted, his recommendation is to go ahead and upgrade (though no hair on fire rush).
Reference
The Kubernetes project recently disclosed new security vulnerabilities CVE-2017-1002101 and CVE-2017-1002102, allowing containers to access files outside the container. All Google Kubernetes Engine (GKE) nodes are affected by these vulnerabilities, and we recommend that you upgrade to the latest patch version as soon as possible, as we detail below.
What should I do?
Due to the severity of these vulnerabilities, whether you have node-autoupgrade enabled or not, we recommend that you manually upgrade your nodes as soon as the patch becomes available. The patch will be available for all customers by March 16th, but it may be available for you sooner based on the zone your cluster is in, according to the rollout schedule.
The text was updated successfully, but these errors were encountered: