Regression - etcd datadir permissions not set on etcd grow #2256

dannysauer · 2020-08-18T21:27:21Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.14+

Environment:
N/A

What happened?

With the release of etcd 3.4.10, the datadir permissions now need to be 0700 or etcd won't start. There was an issue (#1308) where perms were set on join before starting the etcd container as a security control, overriding the default behavior of creating a non-existant directory mode 0755. However, in a cleanup, that necessary os.mkdirall was removed. This was transparently ignored for several releases since etcd didn't complain, but with etcd-io/etcd#11798 (in 3.4.10), the new etcd cluster on the second node does not start.

I'm pretty sure this will break anyone on k8s 1.14 or newer who upgrades to etcd 3.4.10 or newer without first fixing the /var/lib/etcd perms.

What you expected to happen?

/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)

How to reproduce it (as minimally and precisely as possible)?

Join a second master node, then ls -ld /var/lib/etcd on the node. With an etcd 3.4.10 or newer runtime

Anything else we need to know?

It's worth explicitly noting that the first control plane node added works fine. It's just the second and subsequent nodes which were handled in a separate location in the code which exhibit the problem.

The text was updated successfully, but these errors were encountered:

neolit123 · 2020-08-18T22:13:49Z

/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)

hm, isn't this what kubeadm is doing by default?
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90

neolit123 · 2020-08-18T22:19:08Z

However, in a cleanup, that necessary os.mkdirall was removed

do you mean this line?
kubernetes/kubernetes@6bbed9f#diff-0960dc0bb7c3e7113a0daa027856b8f9L467

it is still here https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90

here is the version mapping between kubeadm(k8s) version and etcd:
https://github.com/kubernetes/kubernetes/blob/3f585835d0a38462bb928454245ec01e834e53a0/cmd/kubeadm/app/constants/constants.go#L430-L437

so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10?
this is not recommended by the kubeadm team, but also 1.14 is out of support at this point (1.16 to go out of support soon too).

dannysauer · 2020-08-18T23:50:30Z

There were two mkdirs - the one at bootstrap time (which is still there) and a second one which ran when additional nodes were run. The referenced cleanup SR removed the second one. Which makes sense in the context of the observed behavior; additional control-plane nodes have the /var/lib/etcd directory created with mode 0755 (the kubernetes default), and looking at some older clusters I have around, have been doing that for quite some time. Assuming I identified the right PR, it's been happening since 1.14, and still happening on the 1.18 cluster I stood up this morning to test the new etcd. It would've also been happening before #1308, but those systems are solidly out of support by now.

so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10?
Possibly, but the greater concern is seeing it happen with >= 1.14 and etcd server >= 3.4.10

neolit123 · 2020-08-19T00:48:49Z

ok, so originally a similar fix was added here in the function that creates the static pods for 1.14-pre:
kubernetes/kubernetes@836f413#diff-c4574f3918f016aeb3b32f5d9cb62ed6

later that code was moved (by ereslibre):
kubernetes/kubernetes@981bf19

and then this refactor that you linked indeed omitted it.
kubernetes/kubernetes#73452
that PR was very noisy and had 140+ comments and we might have missed this.

so yes, we should not let the kubelet create the path with 755, and include the following:

// pre-create the etcd data directory with the right permissions
if err := os.MkdirAll(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to create etcd directory %q", cfg.Etcd.Local.DataDir)
}
// if the directory already existed and the above call was a no-op, ensure it has the right permissions
// or otherwise etcd 3.4.10+ will fail:
// https://github.com/etcd-io/etcd/pull/11798
if err := os.Chmod(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to chmod etcd directory %q", cfg.Etcd.Local.DataDir)
}

at this line:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L133

thanks for reporting it.

neolit123 · 2020-08-19T00:53:34Z

/help
/area security
/milestone v1.20
/priority important-longterm

i don't see this as a critical bug for 1.19 during code freeze.

given there is a chance an older version of kubeadm could be used to run etcd 3.4.10 we might as well backport the change to the support skew (1.19, 1.18, 1.17) after 1.19 releases.

k8s-ci-robot · 2020-08-19T00:53:35Z

@neolit123:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/area security
/milestone v1.20
/priority important-longterm

i don't see this as a critical bug for 1.19 during code freeze.

given there is a chance an older version of kubeadm could be used to run etcd 3.4.10 we might as well backport the change to the support skew (1.19, 1.18, 1.17) after 1.19 releases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2020-08-19T01:05:10Z

one problem though is if the user is upgrading an existing cluster that has the 755 data dir.
in that case we should also ensure to chmod it due to etcd-io/etcd#11798

EDIT: ... but i guess this also has to be done on upgrade. removing "help" as this requires more changes.
/remove-help

frittenlab · 2020-08-19T09:07:25Z

I also just run into this problem trying to upgrade an existing etcd cluster from 3.4.9 to 3.4.11. The etcd node refuses to start with:

Aug 19 11:03:43 etcd0-dev etcd[26085]: cannot access data directory: directory "/var/lib/etcd","drwxr-xr-x" exist without desired file permissions

neolit123 · 2020-08-19T09:30:08Z

@frittenlab hi, what k8s/kubeadm version are you using?

looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too:
kubernetes/kubernetes#93450
https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/CAOXj30tEQzy0uSq9Oc4dX2UfSR%3DsCaqWwwUzymLrDxLirvHJ%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer

also:

backport the change to older versions (which ones?)
add a note in the kubeadm troubleshooting guide

/milestone v1.19

frittenlab · 2020-08-19T09:49:20Z

@neolit123 We are currently using kubeadm / k8s 1.18.8. We use an external etcd cluster which we provisioned ourselves on dedicated localstorage vm's. We added the etcd cluster to kubeadm via the kubeadm-config configmap. I was able to upgrade the etcd cluster after changing the directory permissions of

/var/lib/etcd

to 700

neolit123 · 2020-08-19T14:07:00Z

@frittenlab
just to double check, i'm assuming you did not follow this guide:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/
IIUC, if using that guide (kubeadm init) the directory should have been created with 0700.

neolit123 · 2020-08-19T15:31:41Z

PR for master is here:
kubernetes/kubernetes#94102

frittenlab · 2020-08-19T15:45:20Z

@neolit123 We did follow this guide. We created the cluster over 2 years ago. I can't recall having to change the mode of the data dir then. I added the option to out ansible role for the etcd cluster. And everything works fine now. This was not necessary with previous etc versions.

neolit123 · 2020-08-19T15:46:35Z

ok, in the above PR i've added a chmod 700 in kubeadm init even if the directory already exists, just in case.

frittenlab · 2020-08-19T15:47:43Z

Thanks that is a good idea :)

liggitt · 2020-08-19T17:42:06Z

looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too:

that's only updating the client library, not the server version

neolit123 · 2020-08-19T17:45:29Z

that's only updating the client library, not the server version

ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.

liggitt · 2020-08-19T17:46:14Z

ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.

agree

liggitt · 2020-08-19T18:16:18Z

actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

neolit123 · 2020-08-19T18:38:35Z

@jingyih in which etcd version are you planning to resolve this?
would you recommend that kubeadm pre-creates this directory with 0700 for all OSes?

if a fix in etcd is coming, we can just add a note about this in the kubeadm troubleshooting guide and move the pending PR to 1.20.
the PR itself can continue pre-creating the directory with 0700 (similar is done for /var/lib/kubelet) but possibly omit the chmod, as it overrides custom permissions by the user.

dannysauer · 2020-08-19T19:31:22Z

The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).

Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk. But leaving the current behavior is going to cause people following the CIS kubernetes benchmark and similar hardening guides with a bit of heartburn when they have to fix it on about every control plane node. Never mind how many people are doing so, given this has gone unnoticed for years. 🤦

I think kubernetes manifests will stay on 3.4.9 until this is resolved

FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.

neolit123 · 2020-08-19T19:46:48Z

The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).

yes, that is why i think kubernetes/kubernetes#94102 is still viable. but while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.

Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk.

with the member sub-directory being protected with 0700, it is indeed not that much of security issue.

FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.

we rarely update the default etcd server version (constant) in prior kubeadm releases but it has happened before.
yet, for 1.19 we might be out of time to release a new etcd server version at this point.

kubeadm makes it possible to deploy a custom etcd image/version, but in later kubeadm versions this comes with the trade off that etcd upgrades will be skipped when doing "kubeadm upgrade".

dannysauer · 2020-08-19T22:25:22Z

while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.

Yeah, I'd either leave that out, or check for root/root:0755 and only change in that case. Surely no one's consciously choosing that. :)

jingyih · 2020-08-21T14:36:52Z

actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

I discussed with @spzala about this. We are thinking about providing a warning message instead of enforcing the file permission:
etcd-io/etcd#12242

neolit123 · 2020-09-03T15:42:05Z

updated the PR to only create the directory if it does not exist on init/join-control-plane, but not chmod it.
kubernetes/kubernetes#94102

/remove-priority important-soon
/priority backlog
lowering priority since the fix in etcd was applied and 1.20 will include 3.4.13+.
(there is also discussion to backport the etcd version to 1.19)

neolit123 · 2020-09-15T21:37:28Z

kubernetes/kubernetes#94102 merged
newer etcd was backported to 1.19 too:
kubernetes/kubernetes#94536

closing, thanks.

dannysauer mentioned this issue Aug 18, 2020

Update etcd to 3.4.10+ (3.4.13) SUSE/skuba#1327

Closed

neolit123 added area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 18, 2020

k8s-ci-robot added this to the v1.20 milestone Aug 19, 2020

k8s-ci-robot added area/security help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 19, 2020

neolit123 added kind/bug Categorizes issue or PR as related to a bug. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 19, 2020

neolit123 removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 19, 2020

k8s-ci-robot modified the milestones: v1.20, v1.19 Aug 19, 2020

neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 19, 2020

neolit123 self-assigned this Aug 19, 2020

neolit123 added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 19, 2020

dannysauer changed the title ~~Reversion - etcd datadir permissions not set on etcd grow~~ Regression - etcd datadir permissions not set on etcd grow Aug 19, 2020

neolit123 mentioned this issue Aug 19, 2020

kubeadm: adjust the logic around etcd data directory creation kubernetes/kubernetes#94102

Merged

BenTheElder mentioned this issue Aug 19, 2020

Update to etcd client version 3.4.12 kubernetes/kubernetes#94105

Merged

neolit123 modified the milestones: v1.19, v1.20 Aug 25, 2020

This was referenced Aug 26, 2020

Update etcd client to v3.4.13 kubernetes/kubernetes#94259

Merged

Update default etcd server to 3.4.13 kubernetes/kubernetes#94287

Merged

k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 3, 2020

neolit123 closed this as completed Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression - etcd datadir permissions not set on etcd grow #2256

Regression - etcd datadir permissions not set on etcd grow #2256

dannysauer commented Aug 18, 2020 •

edited

Loading

neolit123 commented Aug 18, 2020

neolit123 commented Aug 18, 2020

dannysauer commented Aug 18, 2020

neolit123 commented Aug 19, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020

k8s-ci-robot commented Aug 19, 2020

neolit123 commented Aug 19, 2020 •

edited

Loading

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

liggitt commented Aug 19, 2020

neolit123 commented Aug 19, 2020

liggitt commented Aug 19, 2020

liggitt commented Aug 19, 2020

neolit123 commented Aug 19, 2020 •

edited

Loading

dannysauer commented Aug 19, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020 •

edited

Loading

dannysauer commented Aug 19, 2020

jingyih commented Aug 21, 2020

neolit123 commented Sep 3, 2020

neolit123 commented Sep 15, 2020

Regression - etcd datadir permissions not set on etcd grow #2256

Regression - etcd datadir permissions not set on etcd grow #2256

Comments

dannysauer commented Aug 18, 2020 • edited Loading

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

neolit123 commented Aug 18, 2020

neolit123 commented Aug 18, 2020

dannysauer commented Aug 18, 2020

neolit123 commented Aug 19, 2020 • edited Loading

neolit123 commented Aug 19, 2020

k8s-ci-robot commented Aug 19, 2020

neolit123 commented Aug 19, 2020 • edited Loading

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

neolit123 commented Aug 19, 2020

frittenlab commented Aug 19, 2020

liggitt commented Aug 19, 2020

neolit123 commented Aug 19, 2020

liggitt commented Aug 19, 2020

liggitt commented Aug 19, 2020

neolit123 commented Aug 19, 2020 • edited Loading

dannysauer commented Aug 19, 2020 • edited Loading

neolit123 commented Aug 19, 2020 • edited Loading

dannysauer commented Aug 19, 2020

jingyih commented Aug 21, 2020

neolit123 commented Sep 3, 2020

neolit123 commented Sep 15, 2020

dannysauer commented Aug 18, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020 •

edited

Loading

dannysauer commented Aug 19, 2020 •

edited

Loading

neolit123 commented Aug 19, 2020 •

edited

Loading