Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Policy-based Disk Encryption on Boot Images #15

Merged
merged 4 commits into from
Sep 24, 2019

Conversation

darkmuggle
Copy link
Contributor

@darkmuggle darkmuggle commented Sep 17, 2019

Starting with OpenShift 4.3, CoreOS boot images will be encryption-ready. On first boot either administator provided or automated policy will be used for the root disk encryption. Bootstrapping and node provisioning MUST fail when the policy cannot be applied unless the administrator has opted out.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 17, 2019
@darkmuggle
Copy link
Contributor Author

/assign @childsb

enhancements/coreos-encrypted-by-default.md Outdated Show resolved Hide resolved
enhancements/coreos-encrypted-by-default.md Outdated Show resolved Hide resolved
enhancements/coreos-encrypted-by-default.md Outdated Show resolved Hide resolved
enhancements/coreos-encrypted-by-default.md Outdated Show resolved Hide resolved
enhancements/coreos-encrypted-by-default.md Outdated Show resolved Hide resolved
@darkmuggle
Copy link
Contributor Author

/cc @miabbott @ashcrow @imcleod

@cgwalters
Copy link
Member

I started writing some code around the ignition+ostree redeployment bit: coreos/ignition-dracut#107

@chancez
Copy link

chancez commented Sep 18, 2019

/unassign @chancez

@bparees bparees removed the request for review from chancez September 18, 2019 15:21
@darkmuggle
Copy link
Contributor Author

I started writing some code around the ignition+ostree redeployment bit: coreos/ignition-dracut#107

As noted in the Ignition Dracut PR, Ignition does not support LUKS creation.

@cgwalters
Copy link
Member

cgwalters commented Sep 18, 2019

Let me try to rephrase/capture my concerns.

I'd like to see this proposal define a very high level way (e.g. via install-config.yaml or via MachineConfig) that admins can configure this. Let's avoid other implementation details that expose to administrators how it's implemented - e.g. rather than documenting a coreos.no_encryption=1 kernel commandline option - we only document an install-config.yaml stanza.

The reason is: this proposal creates yet another substantial difference between FCOS and RHCOS. Further, it creates that divergence as a short-term fix for one point in larger problem space around root filesystem configuration. This proposal is also implicitly saying that root filesystem encryption is e.g. more important than switching RHCOS to Ignition spec 3 and adding LUKS support to Ignition - that may be the case, but we should understand that. (For example, we want to support LUKS on RAID, LVM etc.)

I'd like to ensure that whatever we do here can be transparently upgraded to full Ignition 3 support later.

Now, this gets into my concern around exactly what admins can configure. My 2c on this is that we start from the very simplest: by default, the root filesystem is encrypted with a key that is not accessible after boot. If a TPM is available, we use that. If a TPM is not available...well, we can only easily do hacks - or we do something like have an in-cluster Tang server, i.e. we do NBDE by default. We need to clearly call out the disaster recovery implications of this - we need to make clear, there's no "recovery key" or anything like that. (Or, alternatively we require the admin specify a recovery key?)

We know we want to support external-to-the cluster Tang servers - that's an easy case. But again, let's define what admins write in an install-config.yaml (Or a MachineConfig extension).

@darkmuggle
Copy link
Contributor Author

Unpacking @cgwalters concerns here (which I greatly appreciate!)

I'd like to see this proposal define a very high level way (e.g. via install-config.yaml or via MachineConfig) that admins can configure this. Let's avoid other implementation details that expose to administrators how it's implemented - e.g. rather than documenting a coreos.no_encryption=1 kernel commandline option - we only document an install-config.yaml stanza.

That's fair. I spend the time defining the OS level perspective and I wanted to get the base idea out for review first.

The reason is: this proposal creates a substantial difference between FCOS and RHCOS.

FCOS is for general clusters and container workloads, while RHCOS is single-purpose in running Openshift. Alignment is highly desirable; divergence should be tolerable when it enables key features for that single-use.

Further, it creates that divergence as a short-term fix for one point in larger problem space around root filesystem configuration. This proposal is also implicitly saying that root filesystem encryption is e.g. more important than switching RHCOS to Ignition spec 3 and adding LUKS support to Ignition - that may be the case, but we should understand that. (For example, we want to support LUKS on RAID, LVM etc.)

Calling this a "short-term" does not do the proposal justice. To our users, encrypted data-at-risk is very germane and it can make the difference between adoption. Users are asking for this now and its an adoption blocker. For example, without encryption, FIPS 140-2 compliance cannot happen.

Simply put, our users do not care about Ignition or Ignition specs unless they have to; that's a developer concern. The users care about encryption.

What this proposal does is to us developers give a PATH and TIME at the cost of short-term divergence in on-disk format. This proposal solves the problem in a universal way, and gives developers time to ensure that we do the Ignition work well.

I'd like to ensure that whatever we do here can be transparently upgraded to full Ignition 3 support later.

One of the merits of this idea is that it does not require Ignition or MCO support today; both would be channels for delivering the Cleivs config.

I have spoken to the Ignition maintainers about the migration path and did so specifically before submitting this enhancement. Once we have LUKS support in Ignition, we can use a default Ignition configuration that replicates the behavior. I would argue that since the change to a LUKS container is for the boot image, this gives a path forward since it does not have a surface in Ignition.

Now, this gets into my concern around exactly what admins can configure. My 2c on this is that we start from the very simplest: by default, the root filesystem is encrypted with a key that is not accessible after boot.

By key, I would assume you mean the LUKS passphrase?

The proposal state the key is not accessible after boot without doing per-node spelunking. The random passphrase is provisioned only for the purpose of re-encrypting the LUKS container. After the Clevis binding happens, the key-slot with the random passphrase is destroyed, and the only way to access it is through the token written by Clevis.

If a TPM is available, we use that. If a TPM is not available...well, we can only easily do hacks - or we do something like have an in-cluster Tang server, i.e. we do NBDE by default.

In-cluster Tang was considered but rejected due to the complexity of the idea. How do you deal with a rolling reboot or a power-outage?

By "hack" are you referencing the proposed plain-text pin for Clevis? I would agree its less than perfect -- and the proposal calls out it specifically -- but it solves the situation where a cluster is bootstrapped and then an admin on day N wishes to move to a NBDE model.

We need to clearly call out the disaster recovery implications of this - we need to make clear, there's no "recovery key" or anything like that. (Or, alternatively we require the admin specify a recovery key?)

Under the "Risk and Mitigations," the proposal illustrates how recovery would look if it was booted into the initramfs: clevis-luks-open -d ... In terms of Openshift, though, unless you're in a cluster down the situation, restoring access to the network-bound service (for Tang or KMS) and rebooting should be sufficient in 99% of the cases. Otherwise, doing the Openshift method of reprovisioning the node would be the preferable case.

We know we want to support external-to-the cluster Tang servers - that's an easy case. But again, let's define what admins write in an install-config.yaml (Or a MachineConfig extension).

I work on that tomorrow

@cgwalters
Copy link
Member

Simply put, our users do not care about Ignition or Ignition specs unless they have to; that's a developer concern. The users care about encryption.

MachineConfigs embed Ignition - so Ignition is very much user facing. We want to dig ourselves out of the pit we're in now where searching for Ignition stuff lands on coreos.com and one gets confused about spec 2 vs 3 and whether it's talking about Container Linux or... etc.

I think we definitely want administrators to write Ignition (or MachineConfigs) and finish the vision of supporting re-laying out the rootfs. Let's be sure we're on the same page on that! This document seems to more describe changing Ignition as "out of scope". Agreed to change the wording there to make it clear that e.g. in the future, RHCOS will switch back to unifying with FCOS and not come as a null cipher LUKS by default?

@cgwalters
Copy link
Member

BTW, another reason we want the full Ignition re-partitioning support is that today one can provide an Ignition config which creates /var/lib/etcd as a separate partition for the masters which is a good practice I think (maybe it's on a hardware raid device, separate higher-IOPS EBS volume or whatever). Today one can't easily LUKS encrypt that, and this enhancement in general doesn't have any user story for how one does LUKS for anything except the default /.

@ashcrow
Copy link
Member

ashcrow commented Sep 19, 2019

FCOS being the upstream of RHCOS is super important, but it can't be the reason we don't move forward adding encryption at rest support for OCP.

We want to dig ourselves out of the pit we're in now where searching for Ignition stuff lands on coreos.com and one gets confused about spec 2 vs 3 and whether it's talking about Container Linux or... etc.

100%

I think we definitely want administrators to write Ignition (or MachineConfigs) and finish the vision of supporting re-laying out the rootfs. Let's be sure we're on the same page on that!

I'm not sure we all are on that same page. Allowing (and in some cases encouraging) MC editing was not originally what we wanted admins to do. However, it became a necessity. We also now have fcct as a way to hide some of the complexity.

I'm a bit confused on how writing MCs are strongly linked to re-laying out rootfs. Is this meaning re-formatting? Encryption/re-encrption? etc..

BTW, another reason we want the full Ignition re-partitioning support is that today one can provide an Ignition config which creates /var/lib/etcd as a separate partition for the masters [..]

True. People can do that, but they can do a lot of things with the awesome tool set Linux and OCP provide. Do you see /var/lib/etcd re-partitioning and encryption a requirement for the initial encryption story?

@cgwalters
Copy link
Member

cgwalters commented Sep 19, 2019

I'm not sure we all are on that same page. Allowing (and in some cases encouraging) MC editing was not originally what we wanted admins to do. However, it became a necessity. We also now have fcct as a way to hide some of the complexity.

For sure there's been debate on this. But ultimately we shipped OpenShift 4.1, and the way to configure the host is MachineConfig and it's documented.
(Although now that I look, not very well...the upstream docs are better argh)

fcct as you know only applies to FCOS. However I see MachineConfigs as also increasingly playing the role of FCCT in offering "high level sugar" around Ignition for RHCOS/OpenShift.

I'm a bit confused on how writing MCs are strongly linked to re-laying out rootfs. Is this meaning re-formatting? Encryption/re-encrption? etc..

Well, this hasn't been implemented or designed fully, but the way I was thinking of this is that an administrator should be able to create a MachineConfig object that includes the storage section (disallowed today). If implemented in a basic way, this would cause new workers to spawn using that storage configuration. So yes, if you wanted to e.g. turn on or off LUKS that could be a day-2 operation at least for workers. Implementing it in a more sophisticated way would require machineAPI integration - the MCO would have to say "OK I can't live-update storage, let me ask the machineAPI to reprovision this machine".

@ashcrow
Copy link
Member

ashcrow commented Sep 20, 2019

Approved from my POV 👍

@cgwalters
Copy link
Member

In prow terms,
/approve

There's other edits/improvements to make here, but the broad outline LGTM. We probably need to be sure architects are good with this as well before trying to revise a lot of the details.

@darkmuggle darkmuggle changed the title Automated Policy-based DiskEncryption on Boot Images Automated Policy-based Disk Encryption on Boot Images Sep 20, 2019
@cgwalters cgwalters mentioned this pull request Sep 20, 2019
2 tasks
@darkmuggle
Copy link
Contributor Author

/assign @smarterclayton @imcleod

Copy link
Member

@miabbott miabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an excellent starting point for the data encryption story. 👍

@miabbott
Copy link
Member

miabbott commented Sep 23, 2019

Since there are explicit requirements on the Installer for supporting this, let's get @abhinavdahiya and @wking aware of this, too.

@ashcrow
Copy link
Member

ashcrow commented Sep 24, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters, darkmuggle, miabbott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 142e106 into openshift:master Sep 24, 2019
@darkmuggle darkmuggle deleted the rhcos-disk branch September 24, 2019 17:24
cgwalters added a commit to cgwalters/installer that referenced this pull request Oct 30, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
cgwalters added a commit to cgwalters/installer that referenced this pull request Oct 30, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
cgwalters added a commit to cgwalters/installer that referenced this pull request Oct 30, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
cgwalters added a commit to cgwalters/installer that referenced this pull request Oct 31, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
alaypatel07 pushed a commit to alaypatel07/installer that referenced this pull request Nov 11, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
cgwalters added a commit to cgwalters/installer that referenced this pull request Nov 11, 2019
Today for IaaS clouds, we default to an instance type, which
then in turn usually provides a default size.  RHCOS resizes on
boot to that size, distinct from its "default" 16G size.

However, libvirt installs were inheriting our default size.  We'd
like to shrink it because we plan to land encryption:
openshift/enhancements#15
And the less data we need to encrypt, the better.

(In the future I'd like to make this configurable with a variable,
 but let's just prepare for the encryption work now)
alaypatel07 pushed a commit to alaypatel07/installer that referenced this pull request Nov 13, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
cgwalters added a commit to cgwalters/installer that referenced this pull request Nov 15, 2019
Today for IaaS clouds, we default to an instance type, which
then in turn usually provides a default size.  RHCOS resizes on
boot to that size, distinct from its "default" 16G size.

However, libvirt installs were inheriting our default size.  We'd
like to shrink it because we plan to land encryption:
openshift/enhancements#15
And the less data we need to encrypt, the better.

(In the future I'd like to make this configurable with a variable,
 but let's just prepare for the encryption work now)
vrutkovs pushed a commit to vrutkovs/installer that referenced this pull request Dec 2, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
vrutkovs pushed a commit to vrutkovs/installer that referenced this pull request Dec 2, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
Part of: openshift/enhancements#15

We added FIPS to the MCO a while ago:
openshift/machine-config-operator#889

However, during some discussion it became clear that the main
use case for FIPS is "day 1" - it doesn't make sense to turn it
on "day 2" because the standard requires that e.g. long-term key
material was created with FIPS enabled.

Further, it's unlikely that admins will want to turn it *off*
if they ever had it on.

This is a good candidate for an install config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet