-
Notifications
You must be signed in to change notification settings - Fork 56
Update etcd to 3.4.10+ (3.4.13) #1327
Update etcd to 3.4.10+ (3.4.13) #1327
Conversation
fixes CVE-2020-15106 bsc#1174951 Signed-off-by: Jordi Massaguer Pla <jmassaguerpla@suse.de>
This goes along with: SUSE/skuba#1327
This is the error I see
|
This is the log from the etcd container that "exited":
|
@dannysauer Were you able to deploy a cluster with this new etcd? Can you give it a try? |
It ran, but I didn't do a full cluster. I'll give that a shot now. |
Well, that's not working out. My SCC access still doesn't work, so I can't enable anything on my SLES 15.2 image, which inhibits building an actual cluster. 🤦 The etcd container log above indicates that there are two nodes in this etcd cluster - 404272f994ea9271 is the one running (on 10.164.95.144), and 69736148f60bba7e, which should be reachable at 10.164.95.139:2380. However, messages sent to 69736148f60bba7e are getting "connection refused", which means either that second etcd instance isn't listening on port 2080, or the connection is being rejected by a firewall rule. I don't think that indicates an error in etcd, since this container's instance didn't log an issue with starting up and binding to port 2080. But that does cause kubeadm to fail to work, since the etcd cluster appears to fail to enter a healthy state. I thought I'd pull down the container and see if I could replicate that behavior in just a manual etcd cluster, but the registry URL in the kubelet log - registry.suse.de/devel/caasp/4.5/containers/containers/caasp/v4.5/etcd:3.4.10 - isn't something podman will pull for me. I had to use registry.suse.de/devel/caasp/4.5/branches/etcd_3.4.10/containers/caasp/v4.5/etcd:3.4.10. So I'm a little curious how the CI worked when I can't get that container to run at all here. But that's maybe a separate thing. |
Hi! I just sent you a subscription key for SCC on private message. CI uses a cri-o feature for mirroring registry.suse.de/devel/caasp/4.5/containers/containers/ to registry.suse.de/devel/caasp/4.5/branches/etcd_3.4.10/. In short, you can see the registries.conf configuration in:
|
I was able to pass the tests with master, this is why I think the etcd update is the cause of the "broken deployment" |
Well, if changing etcd is what breaks it, then it's reasonable to guess that etcd is the problem. 😂 Thanks for sharing the key. I'll get a cluster up this afternoon and see if I can track down where things are going wrong. |
Ah. On the other master node, which is .139, there's this:
So node .144 has an etcd which (perhaps accurately) thinks it's in an existing cluster with .139, but .139 isn't starting because the permissions on the datadir are wrong. Sure enough:
After fixing the data directory permissions to what they're supposed to be, the second etcd came up just fine and then the first one also came up. Provisioning failed because the second etcd was added but didn't run, which makes a 2-node etcd cluster that thinks its in split-brain. In that case, etcd doesn't handle reads or writes, and then the API server can't API so everything fails. |
How do we fix this? And why is it happening with the new etcd and not with the old one? |
I wonder if it was a sporadic failure in the CI? I'm digging to see what makes that directory to begin with. I think the container expects that to exist before it starts, so I'm assuming either kubernetes or skuba? Does cri-o create the directory if a non-existing directory is specified as a volume? That would make sense for why it was created with the default 022 umask. |
Adding the second etcd node failed the same way on reexecution. So this seems likely to be related to the container somehow. I'll look into what is different here this afternoon. If we can't find the issue, I don't think this vulnerability is significant enough to delay the initial release. The etcd endpoints should only be accessible inside the cluster if a customer has set up their firewall rules / network segmentation following our suggestions in the admin guide; etcd should only be accessible by k8s nodes (or by trusted nodes). Exploiting this vulnerability requires an attacker to take control of the etcd leader in order to send crafted WAL entries, which means access to the SSL certs or local machine access. Those are generally pretty high bars. So, delaying this to a maintenance release rather than calling it a release blocker should be ok. |
Looks like this is actually a feature in newer etcd. I looked at an old 4.2.0 cluster:
Note that the first master node created has the right permissions on /var/lib/etcd, but the second and third nodes do not. What's new is that etcd requires the tighter perms. There was an issue created in kubeadm which addressed this behavior a while ago - kubernetes/kubeadm#1308. So, could be that there's actually a defect (either on our end of kubeadm) when growing the etcd cluster. This directory permission checking behavior is newly introduced with etcd 3.4.10 in etcd-io/etcd#11798, and is mentioned as a breaking change in the changelog (whoops) https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md |
Looks like the mkdir for additional etcd nodes was removed in kubernetes/kubernetes@6bbed9f#diff-0960dc0bb7c3e7113a0daa027856b8f9. Opening upstream issue. |
I guess this is blocked until kubernetes/kubeadm#2256 is resolved or one of us fixes it locally. |
I am happy we could test this before accepting the packages into the build service repo. @pablochacin @davidcassany well done with the "github label to ibs project" link CI feature 👍 |
The failure was reduced to a warning in 3.4.13 at the end of last week, so we can deploy that without waiting for the kubectl fix/backport or breaking customers. I'll get the IBS request updated. |
I'm still having a problem with it not building cleanly after the OBS go module service updates the vendor tarball I might need some help from someone who knows how go module vendoring works better than I do (which is setting the bar pretty low :D). |
if Im not mistaken this is superseeded by #1402 so closing this |
Why is this PR needed?
fixes CVE-2020-15106 bsc#1174951 https://github.com/SUSE/avant-garde/issues/1876
Reminder: Add the "fixes bsc#XXXX" to the title of the commit so that it will
appear in the changelog.
What does this PR do?
fix a security issue for etcd
Anything else a reviewer needs to know?
The packages are in https://build.suse.de/project/show/Devel:CaaSP:4.5:Branches:etcd_3.4.10
Info for QA
This is info for QA so that they can validate this. This is mandatory if this PR fixes a bug.
If this is a new feature, a good description in "What does this PR do" may be enough.
Related info
Info that can be relevant for QA:
Status BEFORE applying the patch
How can we reproduce the issue? How can we see this issue? Please provide the steps and the prove
this issue is not fixed.
** Check the etcd version **
Status AFTER applying the patch
How can we validate this issue is fixed? Please provide the steps and the prove this issue is fixed.
** Check etcd version and check test results for regressions **
Docs
If docs need to be updated, please add a link to a PR to https://github.com/SUSE/doc-caasp.
At the time of creating the issue, this PR can be work in progress (set its title to [WIP]),
but the documentation needs to be finalized before the PR can be merged.
etcd version should be updated into the attributes file
SUSE/doc-caasp#972
Merge restrictions
(Please do not edit this)
We are in v4-maintenance phase, so we will restrict what can be merged to prevent unexpected surprises: