Support a provisioning token for the Machine Config Server #443

cgwalters · 2020-08-19T13:47:16Z

See openshift/machine-config-operator#784

The Ignition configuration can contain secrets, and we want to avoid it being accessible both inside and outside the cluster.

dhellmann · 2020-08-19T14:06:54Z

enhancements/machine-config/auth-token.md

+
+For installer-provisioned infrastructure, this will be much easier to implement for installations where the `user-data` provided to nodes is owned by the MCO: https://github.com/openshift/enhancements/blob/master/enhancements/machine-config/user-data-secret-managed.md
+
+In order to avoid race conditions, the MCS might support the "previous" token for a limited period of time (e.g. one hour/day).


Rotating the token will have implications for requirements for the ability to stage bare metal nodes before delivering them to remote sites for telco customers. If the worker node is installed on day 1, then it takes 3 days for it to be delivered to the site of a tower where it is powered back on and connected to the cluster, the token it has will have expired.

Thanks for looking at this!

The short answer would be "don't rotate the token".

A better fix needs a lot more information on exactly what "stage" means. I think a generally good workflow we want to support in CoreOS is using coreos-installer in this "staging" site and configuring networking etc. and have the "pointer ignition config" ready to fetch the remote target. In this scenario, the staging site would definitely need to use the correct token in the pointer config.

I think the answer is probably just "don't rotate the token while any nodes are being staged".

Particularly for bare metal as I mention at the end, I think the clearly best solution is using TPM2 (e.g. in this case the "staging site" can communicate the hardware identity of the machine to the target in advance even), but that's much more involved.

Thanks for looking at this!

The short answer would be "don't rotate the token".

I think that means there would be no IPI for bare metal with remote workers, which would defeat the goal.

A better fix needs a lot more information on exactly what "stage" means. I think a generally good workflow we want to support in CoreOS is using coreos-installer in this "staging" site and configuring networking etc. and have the "pointer ignition config" ready to fetch the remote target. In this scenario, the staging site would definitely need to use the correct token in the pointer config.

I'm not sure what the policy about linking to product requirement tickets in GitHub reviews is, so I'll try to summarize. We need to install the host at one location so that it is configured to join its cluster properly (that may mean the pointer config, or something else, it's unclear). Then some of the hosts will be physically moved to different remote sites, plugged into the network, and they need to rejoin the cluster without any user intervention (we can't assume the person installing the hardware is qualified to do more than rack it and plug it in).

IPI for bare metal does not use coreos-installer, it writes a disk image to a block device, including the pointer ignition configuration. See https://github.com/openshift/enhancements/blob/master/enhancements/baremetal/baremetal.md

It sounds like you're suggesting that during staging the image and ignition pointer should be written to the host, but the host shouldn't be booted until it is at the remote location to preserve the pointer configuration. Is that right? I don't expect that to be acceptable given the usual burn-in and validation requirements for these sorts of use cases.

I think the answer is probably just "don't rotate the token while any nodes are being staged".

Is the lifetime for the automated rotation configurable?

What happens if the token does rotate after a host leaves the staging site and before it is connected back to the cluster? Does the user ship the host back to the staging site to be re-imaged?

Particularly for bare metal as I mention at the end, I think the clearly best solution is using TPM2 (e.g. in this case the "staging site" can communicate the hardware identity of the machine to the target in advance even), but that's much more involved.

OK I had missed you're talking about IPI metal, not UPI. I think we probably need a whole section on this - the core problem here is that the "automatic IPI" case hinges on the MCO managing the user data, but it's not clear to me that applies in IPI/machineAPI metal (does it?).

In other words, forget about staging for a second - would this enhancement work as described in IPI metal or not?

Then once we have that part figured out...we can think about how we handle rotation.

It might be that for IPI metal, we don't rotate by default - or we add an install-config.yaml option to disable automatic rotation.

In other words, forget about staging for a second - would this enhancement work as described in IPI metal or not?

To rephrase; does https://github.com/openshift/enhancements/blob/master/enhancements/machine-config/user-data-secret-managed.md apply in IPI metal today?

Metal IPI uses the user-data secret to pass the ignition pointer configuration to the host via cloudinit. So, yes, I think the user-data-secret-managed enhancement applies to metal IPI as well.

For metal-ipi we also need to consider the issue of how to access the MCO managed config from the cluster-api-provider ref openshift/machine-config-operator#1690 - today we only provide the pointer config via the config-drive, but that means a chicken/egg issue with advanced network configurations.

With this proposal, would we expect a privileged in-cluster controller to be able to access the MCS, or would we instead have to retrieve the underlying MachineConfig directly?

I don't think this proposal makes hacking things absent openshift/machine-config-operator#1690 significantly more difficult - a privileged pod could get the token to talk to the MCS too.

(It doesn't help either of course)

danwinship · 2020-08-19T14:59:49Z

enhancements/machine-config/auth-token.md

+
+## Summary
+
+This enhancement proposes that for user-provisioned installations we optionally support generating a "provisioning token":


Unless we have customers with the use case "As an administrator, I want unprivileged pods to be able to bootstrap themselves to cluster-administrator, so that untrusted users and outside attackers can totally pwn me more easily", this should not be optional.

If there are problems with this scheme such that we can't force everyone to use it then I don't think it solves the underlying problem.

You're totally right, I have adjusted things to make clear this is on by default.

enhancements/machine-config/auth-token.md

runcom · 2020-08-24T08:11:54Z

cc @ericavonb

derekwaynecarr · 2020-08-24T15:54:29Z

@csrwng please review.

as part of rhcos enablement in ibmcloud managed service, we have discussed putting an auth-proxy in front of ignition server.

csrwng · 2020-08-24T16:23:01Z

Thanks @derekwaynecarr, this looks to do exactly what we want to do for ibmcloud.
@cgwalters What's the mechanism that would be used to pass the token to the MCS? Is it a header? url param? something else?

cgwalters · 2020-08-24T16:32:20Z

@cgwalters What's the mechanism that would be used to pass the token to the MCS? Is it a header? url param? something else?

This proposal is suggesting a URL parameter; doesn't require any changes to Ignition. I'd be fine with a header too if we could think of a good reason for that.

(Long term again I think we clearly want a better "node identity" system, so for this particular enhancement let's do the simple thing that will solve the immediate problem across the board)

derekwaynecarr · 2020-08-24T16:36:59Z

for the managed service context, i feel like we may want to manage the secret external to the cluster. we need to work through the details of that before proceeding too much further.

cgwalters · 2020-08-24T17:00:53Z

for the managed service context, i feel like we may want to manage the secret external to the cluster. we need to work through the details of that before proceeding too much further.

Isn't this the same as the UPI case? I suspect externally managed clusters are also probably fine with not rotating the token to start - but if they chose to do so it'd just be a matter of the external management doing oc -n openshift-config edit secret/machineconfig-auth (a kube resource) and also updating the user-data provided to nodes (however that happens, usually in UPI scenarios it's not a kube resource).

csrwng · 2020-08-24T17:18:51Z

for the managed service context, i feel like we may want to manage the secret external to the cluster. we need to work through the details of that before proceeding too much further.

@derekwaynecarr I would assume that if we're thinking of running the MCS on the control plane side, the secret would live there as well.

danwinship · 2020-08-24T18:58:43Z

This proposal is suggesting a URL parameter

Hm... no, the original version said

When configured, requests to the Machine Config Server will require this token to be provided via e.g. GET /config/worker&token=<value>.

but that text got removed and now it doesn't discuss how the token is provided.

Anyway, does the MCS still support plain http? And if so, should it stop now to avoid exposing the token (and the secret data in the response)? (Although if an attacker is able to snoop traffic on the node network, then you're probably already well on your way to losing anyway...)

derekwaynecarr · 2020-08-24T19:08:40Z

@csrwng

keeping the secret on the management plane (external to the user cluster) makes sense, i just wanted to make sure that we can keep that as an option.

kikisdeliveryservice · 2020-08-27T19:59:12Z

cc: @runcom @ericavonb @sinnykumari @yuqi-zhang

ericavonb · 2020-08-31T18:28:55Z

I'm coming around to to this proposal. The main advantage being that it doesn't require changes to ignition, which most of the more complete solutions would. It moves us a bit closer to a more secure design by at least making it a bit harder to grab the config just by having network access.

With that in mind, I want to note some weaknesses so we're clear on what level of security we get with this:

Sending token via a URL param in a GET request isn't recommended usually
We don't have strong guarantees that the machines we supplied the token to are the ones using it at any given time. Some sort of challenge/response authentication method could solve this, but of course that means chances to ignition.
It seems like rotating the token automatically may cause more trouble than its worth in many cases. I think not rotating is ok, making the token per-cluster is better than what we have. We can work on TPM2 and other machine attestation for part 2.

Some small suggestions to improve this design further:

We should add rate limiting to prevent brute force attacks.
HTTPS is a must or you lose most of the benefits.
Use constant time compare when validating the token.
Even if the token needs to be stored in the cluster itself to be provided to new machine, could we use some RBAC to restrict it from even MCO components? Having the MCS use a salted and hashed token instead?
Let's be clear in docs etc that this does not bring it to the level necessary to securely supply user data or secrets via ignition. No secrets in ignition should still be the default stance.

cgwalters · 2020-09-01T18:23:16Z

Let's be clear in docs etc that this does not bring it to the level necessary to securely supply user data or secrets via ignition. No secrets in ignition should still be the default stance.

That's kind of a root issue because we are injecting secrets in Ignition today (pull secret etc.). I don't have a clear vision for how to get to the point where we're not doing that. Maybe...hmm, I could imagine that we switch to doing "pull through" for images until kubelet is up (and hence the node has joined the cluster and can sync secrets that way) - basically we:

inject pull secret into bootstrap node
ignition served to control plane doesn't include pull secret
instead we run a "pull through" registry on the bootstrap node that is whitelisted to only allow pulling images referenced in the release payload
control plane ignition includes an ICSP pointing to bootstrap's pull through
OS update happens, reboot
ICSP pointing to bootstrap is dropped
real pull secret is written
Similar flow happens for workers except maybe we just patch the main openshift-registry to be the "pull through" point for workers

So we can design something for the pull secret, but I'm uncertain whether we can require system administrators not to put any other secrets in Ignition.

ericavonb · 2020-09-01T22:16:50Z

That's kind of a root issue because we are injecting secrets in Ignition today (pull secret etc.). I don't have a clear vision for how to get to the point where we're not doing that.

I think we should distinguish the pull secret from other types of secrets; it's not the same as say, PII or business intelligence. Leaking it has real economic consequences for Red Hat but we can limit and track any damage starting with rotating it for that customer. It has a different threat model; the ability to launch pods, even unprivileged ones, gives you the same level of access as the pull secret. When we're talking about in-cluster threats such as here, protecting the pull secret it's as important.

cgwalters · 2020-09-02T15:02:25Z

. It has a different threat model; the ability to launch pods, even unprivileged ones, gives you the same level of access as the pull secret.

I think unfortunately that's not quite true; the pull secret somehow acts as an identifier/login that's connected with cloud.redhat.com. Launching containers accessible from the pull secret doesn't let you see the secret itself.

romfreiman · 2020-09-05T21:16:36Z

I'm trying to verify it will not affect the assisted installer flow (seems to me that it should not, assuming no rotation, but want to be sure):

Installer generates the token
It injects it into a secret within the control plane from bootstrap.sh + adds it to master/worker pointers ignitions? {"source":"https://192.168.126.100:22623/config/worker?token=1pY12C5VWKwqeyK4itt8iaFBnqSqwD/sW2lNDezTHPU="} for example.
MCS will verify this token once it will get the pointer request (same as now, for example, it rejects to ignite workers while in bootstrap more)
Am I missing anything?
BTW, rotation will break late binding (aka a node will try to join previously established cluster) or will require the token to be updated in the AI service that will serve the pointer ignition.

cgwalters · 2020-09-08T14:19:58Z

I'm trying to verify it will not affect the assisted installer flow (seems to me that it should not, assuming no rotation, but want to be sure):

Yep, sounds right to me.

cgwalters · 2020-10-06T21:36:51Z

I discovered GCP has a really nice system for proving instance identity: https://cloud.google.com/compute/docs/instances/verifying-instance-identity

cgwalters · 2020-10-06T21:40:29Z

That said I am increasingly thinking that we should try to avoid storing the pull secret in Ignition at all via a design per above. It may not be really hard and would obviate a lot of the need for this.

Part of implementing: openshift/enhancements#443 The installer generates a random token (~password) and injects it into the Ignition pointer configuration and as a secret into the cluster. The MCO will check it.

cgwalters · 2020-11-12T22:18:44Z

openshift/installer#4372

Part of implementing: openshift/enhancements#443 The installer generates a random token (~password) and injects it into the Ignition pointer configuration and as a secret into the cluster. The MCO will check it.

LorbusChris · 2020-11-15T19:08:22Z

This will also have implications for other parts of OpenShift that request and consume Ignition config from the MCS, such as the WMCO.

cgwalters · 2020-11-15T20:22:15Z

This will also have implications for other parts of OpenShift that request and consume Ignition config from the MCS, such as the WMCO.

I think this is the same as openshift/machine-config-operator#1690 AKA #467

LorbusChris · 2020-11-16T02:51:45Z

I'm not sure it's exactly that, although it certainly looks related.
Today the Windows machines download the entire Ign config from the MCS (like Ignition does on Linux machines), but they really only consume a small subset of it, like e.g. the kubelet config.

Copying from a comment I made on the internal chat:

Currently the WMCO itself is not aware of the contents of the Ignition config at all, as it is downloaded by the provisioned Windows machine itself. This proppsed change might be a strong reason to funnel the Ign config through the WMCO, making it the instance that requests the Ignition config from the MCS, and then pass it on to the Windows machines, so it can react to changes in both the Ignition config contents, as well as the token that'll be required to request Ign config in the first place with the above change.
Alternatively, the token could be written to the Windows machines by the WMCO, so they can request the Ign config themselves as they are doing now.

cgwalters · 2020-11-16T14:17:13Z

Today the Windows machines download the entire Ign config from the MCS (like Ignition does on Linux machines), but they really only consume a small subset of it, like e.g. the kubelet config.

Right. I think the common thread across all of these though is that in cluster components shouldn't need to hit the MCS - e.g. openshift-ansible switched to getting the rendered MC object. See https://github.com/openshift/openshift-ansible/blob/2cf2144dff1241702a0f2506f6dd66334341a919/roles/openshift_node/tasks/apply_machine_config.yml#L7 and below.

crawford · 2020-11-24T22:34:05Z

enhancements/machine-config/auth-token.md

+
+For IPI/machineAPI scenarios, this will be handled fully automatically.
+
+For user-provisioned installations, openshift-install will generate a token internally via the equivalent of


As much as possible, OpenShift generates secrets within the cluster so that it doesn't need to worry about securing that information at rest or over the wire. Adding this new functionality would impose a new constraint on administrators (if they want to enjoy the benefit of this mechanism): they would need to run the installation from a secure machine and they would need to securely store the pointer Ignition configs. Without a robust rotation mechanism, that's going to be a pretty tough sell.

they would need to run the installation from a secure machine

What does this mean? Surely if the installation machine is compromised in some way then that undermines everything?

and they would need to securely store the pointer Ignition configs.

That is taken care of already in clouds - accessing the user data is a privileged operation and we block it from pods. The in-cluster pointer config is already a secret type.

In bare metal UPI yes; we'd need to document best practices here. In most bare metal UPI the simple fix here would be to just turn off the webserver hosting the pointer config until you need to do provisioning.

Anywhere the token is stored, it needs to be secured - the chain is only as strong as its weakest link. If we are generating the token on someone's laptop, then they need to ensure that there is nothing snooping, otherwise, they cannot be sure that the token is safe. If there was a rotation mechanism, we could significantly reduce the risk for this particular vector, but it still doesn't address the other locations that this token is stored. Turning on and off a web server based on when the cluster needs to be scaled is not an option. Not only is that cumbersome and error-prone, but most customers' IT is not structured in a way that would even make this possible.

enhancements/machine-config/auth-token.md

crawford · 2020-11-24T22:40:02Z

enhancements/machine-config/auth-token.md

+
+See https://github.com/openshift/machine-config-operator/pull/784
+
+The Ignition configuration can contain secrets, and we want to avoid it being accessible both inside and outside the cluster.


Can you elaborate on this a bit? It's long been the intention and recommendation that Ignition configs do not contain secrets. This proposes moving in the opposite direction (or at least making it very easy to do so), so I want to understand why you think we should change course.

Our Ignition config has contained the pull secret and the kubeconfig from the start in 4.1. There's no course change here.
I've reworked the section to clarify that.

There is a change in course though. As I said, the intention from the start has been to remove secrets from the configs. This is a reversal from that goal and I would like to see it justified since it is pretty foundational to the design.

crawford · 2020-11-24T22:49:09Z

enhancements/machine-config/auth-token.md

+
+#### Non-Ignition components fetching the Ignition config
+
+As part of the Ignition spec 3 transition, we needed to deal with non-Ignition consumers of the Ignition config, such as [openshift-ansible](github.com/openshift/openshift-ansible/) and Windows nodes.  These should generally instead fetch the rendered `MachineConfig` object from the cluster, and not access the MCS port via TCP.


Is that how these components work today and if not, is it possible for them to do so? There's a relatively large amount of setup that needs to happen before a new client is able to fetch objects from the API server and I don't want to gloss over that without being sure it won't be a problem.

Yes, that's how it works today, see the code around here:

https://github.com/openshift/openshift-ansible/blob/5fa04feca49c7e39eb114ae52c7305e679cb01e9/roles/openshift_node/tasks/install.yml#L2

https://github.com/openshift/openshift-ansible/blob/2cf2144dff1241702a0f2506f6dd66334341a919/roles/openshift_node/tasks/apply_machine_config.yml

There's a relatively large amount of setup that needs to happen before a new client is able to fetch objects from the API server

Looks like the ansible code accepts a kubeconfig. Things like rhel7 and Windows MCO should also just be able to run as a pod; what's the problem with that?

Okay, it's good to see that the existing implementations are compatible. Can you update the wording to reflect that? Specifically, reading "should generally" doesn't give me confidence.

crawford · 2020-11-24T22:58:14Z

enhancements/machine-config/auth-token.md

+
+See: [Disaster recovery](https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/about-disaster-recovery.html)
+
+This proposes that the secret for provisioning a node is stored in the cluster itself.  But so is the configuration for the cluster, so this is not a new problem.


This is not an adequate analysis of the potential failure modes for this design. I would expect this section to talk about the user experience in various failure scenarios (e.g. How would an admin troubleshoot auto-scaling failures due to a token mismatch? How would the auth token be recovered from a control plane which has lost quorum?).

How would the auth token be recovered from a control plane which has lost quorum?).

I added a section on this:

Note that the token is only necessary when Ignition is run - which is the first boot of a node before it joins the cluster. If a node is just shut down and restarted, access to the token isn't required. Hence there are no concerns if e.g. the whole control plane (or full cluster) is shut down and restarted.

In the case of e.g. trying to reprovision a control plane node, doing that today already requires the Machine Config Server to run, which requires a cluster functioning enough to run it. The auth token doesn't impose any additional burden to that.

(e.g. How would an admin troubleshoot auto-scaling failures due to a token mismatch?

This one was kind of covered below, I reworked it and moved it up.

Troubleshooting

Debugging nodes failing in Ignition today is painful; usually doing so requires looking at the console. There's no tooling or automation around this. We will need to discuss this in a "node provisioning" section in the documentation.

This scenario would be easier to debug if Ignition could report failures to the MCS.

crawford · 2020-11-24T23:09:41Z

enhancements/machine-config/auth-token.md

+
+That said, we will need to discuss this in a "node provisioning" section in the documentation, and also link to it from the disaster recovery documentation to ensure that an administrator trying to re-provision a failed control plane node has some documentation if they forgot to update the pointer configuration for the control plane.
+
+This scenario would be easier to debug if [Ignition could report failures to the MCS](https://github.com/coreos/ignition/issues/585).


I'm not sure that this is sufficient. In order to safely ingest the reports from Ignition, MCS would need to do authentication. Otherwise, an attacker could attempt to phish with a misleading report or flood the endpoint with bogus reports.

Right, part of the idea here is that if we have this auth token we can use it to authenticate failure reports. That clearly doesn't help for token rotation problems, but it would for all the other cases at least (failed os update pivot, broken Ignition partitioning, etc) that we've hit over time.

crawford · 2020-11-24T23:15:13Z

enhancements/machine-config/auth-token.md

+
+### Upgrades
+
+This will not apply by default on upgrades, even in IPI/machineAPI managed clusters (to start).  However, an administrator absolutely could enable this "day 2".  Particularly for "static" user provisioned infrastructure where the new nodes are mostly manually provisioned, the burden of adding the token into the process of provisioning would likely be quite low.


Unless there is a solid reason, we should avoid fracturing the fleet of clusters by their initial version. This especially applies to security improvements. We don't need to commit to upgrading everyone into this new scheme in the first iteration, but I would like to see a plan for getting out of the hole before we jump into it.

In UPI cases we don't control the pointer config, so there's no way we can automate this right?

We could add some sort of alerting and ask admins to do the switch I guess?

so there's no way we can automate this right?

Right, not with this proposal. And that's my concern. I'd prefer a solution that is less hands-on and possible for us to roll out as an upgrade.

crawford · 2020-11-24T23:18:28Z

enhancements/machine-config/auth-token.md

+
+#### Move all secret data out of Ignition 
+
+This would then gate access on CSR approval, but this introduces bootstrapping problems for the kubelet such as the pull secret.


Given that this is the current best practice, can you expand on this a bit more? What sort of bootstrapping problems do you envision? Some folks on the installer team had looked into this in the past and had a working PoC, so I don't buy that this is an insurmountable challenge.

Well for one thing we apply OS updates before starting kubelet - https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md
And the MCO has systemd units that pull other containers (e.g. coredns) on certain platforms etc.

Trying to move entirely away from pulling container images before kubelet would seem to require a model where we run all containers via kubelet - seems possible but needs design. The kubelet itself currently wants to pull a "pause" image although maybe only in non-hostNetwork: true cases?

A lot of unknowns down this path - I'd have no problem if someone else wanted to try driving this direction though.

crawford · 2020-11-24T23:20:29Z

enhancements/machine-config/auth-token.md

+
+#### Encrypt secrets in Ignition
+
+Similar to above, we have a bootstrapping problem around fetching the key.


Do we? As long as the secrets aren't required to gain access to the API server, we could protect the decryption key behind the CSR process. Can you expand on this alternative?

It seems equivalent to the above - the pull secret is a secret so we'd need to solve all the problems there.

crawford · 2020-11-24T23:22:33Z

enhancements/machine-config/auth-token.md

+
+Similar to above, we have a bootstrapping problem around fetching the key.
+
+#### Improved node identity


I don't follow how this is an alternative to the proposal. I agree we should do more to attest, but I don't see how this protects the Ignition configs served by the MCS.

Added:

In this model we'd change the Ignition process to require the node prove is identity, and the MCS would verify that.

See https://cloud.google.com/compute/docs/instances/verifying-instance-identity#verifying

openshift-ci-robot · 2020-11-24T23:23:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cgwalters
To complete the pull request process, please assign frobware after the PR has been reviewed.
You can assign the PR to them by writing /assign @frobware in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

crawford · 2020-11-24T23:28:12Z

This enhancement is too light on detail for how much trouble it could potentially cause. We need to think about this holistically, as it never ends well when security is bolted on after the fact.

/hold

cgwalters · 2020-11-25T14:19:22Z

This enhancement is too light on detail for how much trouble it could potentially cause. We need to think about this holistically, as it never ends well when security is bolted on after the fact.

OK but we need some path forward here though right? Of these, which is closest to what you're saying?

This isn't a big problem and we can continue to ignore it for a while
I don't like like this and want to do something else but I'm not sure what
I don't like this and would prefer someone investigate the "kubelet only" path
This enhancement mostly sounds like it will work but needs tweaks/clarifications

cgwalters · 2020-11-25T15:06:37Z

OK I thought about the "kubelet only" path a bit more and had an idea: What if we provided a "synthetic" pull secret via Ignition that only allowed pulling the default container images from the in-cluster registry? Then once kubelet joins, the "real" pull secret is pulled from the cluster and installed via say the MCD and kubelet live-updated to use it.

That addresses the main secret (pull secret); though we'd still eventually need to disallow anything that can access Ignition from being able to spam CSRs - though solving that problem I think would be best done with something like cloud node identity systems/TPMs.

We're already blazing the trail of live updates in the MCO and also another benefit here is this would also force us to implement "pull through" in our registry which I suspect would be a large benefit.

crawford · 2020-11-25T19:34:21Z

What if we provided a "synthetic" pull secret via Ignition that only allowed pulling the default container images from the in-cluster registry? Then once kubelet joins, the "real" pull secret is pulled from the cluster and installed via say the MCD and kubelet live-updated to use it.

That was roughly the direction we were moving. The plan was to teach the kubelet how to do the CSR flow and fetch the pull secret before it attempted to pull any containers. If possible, I'd like to leave the Machine Config components out of this flow (less moving parts, less complexity).

That addresses the main secret (pull secret); though we'd still eventually need to disallow anything that can access Ignition from being able to spam CSRs - though solving that problem I think would be best done with something like cloud node identity systems/TPMs.

What's the concern around spamming the CSR endpoint? It's a given that we need a smarter approver (i.e. identity and attestation systems, like you mentioned), but is that not sufficient?

cgwalters · 2020-11-30T20:17:37Z

Pre-enhancement: Move secrets from MCS Ignition to pointer config

This is an alternative to #443

Background

In order for CoreOS nodes to join the cluster, we need to provide Ignition that contains at least:

Stock OpenShift infrastructure (kubelet systemd unit file)
Per-cluster configuration (config files with domain names, etc.)
Pull secret to download container images
Bootstrap kubeconfig

Problem statement

This Ignition is provided on a port 22623 that is currently blocked by the OpenShift SDN - but that isn't implemented by 3rd party SDN layers. It may be possible to access the MCS endpoint as well from outside the cluster; while our default cloud IPI deployments use firewalling to block ingress to it, not all UPI deployments do so.

We particularly want to avoid leaking the pull secret outside the cluster because it is tied to the cluster provisioner's Red Hat account.

It'd also be nice to avoid leaking the bootstrap kubeconfig; while the CSR approval mechanism is reasonably secure, it's clearly better to avoid scenarios where a malicious actor could even get to the point of creating CSRs.

Proposed solution

Move the pull secret and bootstrap kubeconfig into the "pointer Ignition". In all cloud environments that we care about, access to the "user data" in which the pointer Ignition is stored is secured. It can't be accessed outside the cluster because it's a link local IP address, and the OpenShift SDN blocks it from the pod network (see references).

Risks and mitigations

In UPI scenarios, particularly e.g. bare metal and vSphere type UPI that often involves a manually set up webserver, we do not currently require confidentiality for the pointer configuration. This would need to change - it'd require documentation.

Notes and references

OpenShift SDN blocking link local addresses: openshift/origin@9a9f30f
The latest version of which currently lives here https://github.com/openshift/sdn/blob/bba15e2d344a6729d5aa7ac7d1ec14d2022219ab/cmd/sdn-cni-plugin/openshift-sdn_linux.go#L129
RHCOS IMDSv2: RHBZ#1899220

See openshift/machine-config-operator#784 The Ignition configuration can contain secrets, and we want to avoid it being accessible both inside and outside the cluster.

openshift-bot · 2021-03-01T01:32:42Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-03-31T03:24:40Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-04-30T07:01:22Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-04-30T07:01:35Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from kbsingh and smarterclayton August 19, 2020 13:47

dhellmann reviewed Aug 19, 2020

View reviewed changes

danwinship reviewed Aug 19, 2020

View reviewed changes

cgwalters force-pushed the mco-auth-token branch 2 times, most recently from 125a5b1 to 9b733e7 Compare August 19, 2020 15:53

cgwalters force-pushed the mco-auth-token branch from 9b733e7 to c32aaa7 Compare September 24, 2020 00:08

cgwalters force-pushed the mco-auth-token branch from c32aaa7 to 2762a7f Compare October 6, 2020 21:34

cgwalters mentioned this pull request Nov 12, 2020

Add an automatically generated Ignition provisioning token openshift/installer#4372

Closed

cgwalters mentioned this pull request Nov 20, 2020

Installer/MCO: store pointer ignition customizations in MachineConfig #540

Merged

crawford suggested changes Nov 24, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2020

cgwalters force-pushed the mco-auth-token branch from 2762a7f to e876282 Compare November 25, 2020 14:16

Support a provisioning token for the Machine Config Server

702a752

See openshift/machine-config-operator#784 The Ignition configuration can contain secrets, and we want to avoid it being accessible both inside and outside the cluster.

cgwalters force-pushed the mco-auth-token branch from e876282 to 702a752 Compare November 30, 2020 22:37

cgwalters mentioned this pull request Dec 2, 2020

Create a machineconfig for IPI pointer ignition customizations openshift/installer#4413

Merged

cgwalters mentioned this pull request Dec 15, 2020

Capture and report systemd unit failures by default to the MCS openshift/machine-config-operator#1365

Open

vrutkovs mentioned this pull request Dec 18, 2020

[vSphere, UPI] I can read the ignition files from outside of the cluster without providing credentials okd-project/okd#176

Closed

crawford mentioned this pull request Feb 4, 2021

WIP: enhancements/machine-config: securing MCS #626

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 1, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 31, 2021

openshift-ci-robot closed this Apr 30, 2021

cgwalters mentioned this pull request Feb 2, 2022

Consider deleting userdata from provider after Ignition completes coreos/ignition#1315

Closed


		For installer-provisioned infrastructure, this will be much easier to implement for installations where the `user-data` provided to nodes is owned by the MCO: https://github.com/openshift/enhancements/blob/master/enhancements/machine-config/user-data-secret-managed.md

		In order to avoid race conditions, the MCS might support the "previous" token for a limited period of time (e.g. one hour/day).


		## Summary

		This enhancement proposes that for user-provisioned installations we optionally support generating a "provisioning token":


		For IPI/machineAPI scenarios, this will be handled fully automatically.

		For user-provisioned installations, openshift-install will generate a token internally via the equivalent of


		See https://github.com/openshift/machine-config-operator/pull/784

		The Ignition configuration can contain secrets, and we want to avoid it being accessible both inside and outside the cluster.


		#### Non-Ignition components fetching the Ignition config

		As part of the Ignition spec 3 transition, we needed to deal with non-Ignition consumers of the Ignition config, such as [openshift-ansible](github.com/openshift/openshift-ansible/) and Windows nodes. These should generally instead fetch the rendered `MachineConfig` object from the cluster, and not access the MCS port via TCP.


		See: [Disaster recovery](https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/about-disaster-recovery.html)

		This proposes that the secret for provisioning a node is stored in the cluster itself. But so is the configuration for the cluster, so this is not a new problem.


		That said, we will need to discuss this in a "node provisioning" section in the documentation, and also link to it from the disaster recovery documentation to ensure that an administrator trying to re-provision a failed control plane node has some documentation if they forgot to update the pointer configuration for the control plane.

		This scenario would be easier to debug if [Ignition could report failures to the MCS](https://github.com/coreos/ignition/issues/585).


		### Upgrades

		This will not apply by default on upgrades, even in IPI/machineAPI managed clusters (to start). However, an administrator absolutely could enable this "day 2". Particularly for "static" user provisioned infrastructure where the new nodes are mostly manually provisioned, the burden of adding the token into the process of provisioning would likely be quite low.


		#### Move all secret data out of Ignition

		This would then gate access on CSR approval, but this introduces bootstrapping problems for the kubelet such as the pull secret.


		#### Encrypt secrets in Ignition

		Similar to above, we have a bootstrapping problem around fetching the key.


		Similar to above, we have a bootstrapping problem around fetching the key.

		#### Improved node identity

Support a provisioning token for the Machine Config Server #443

Support a provisioning token for the Machine Config Server #443

Conversation

cgwalters commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom commented Aug 24, 2020

derekwaynecarr commented Aug 24, 2020

csrwng commented Aug 24, 2020

cgwalters commented Aug 24, 2020

derekwaynecarr commented Aug 24, 2020

cgwalters commented Aug 24, 2020

csrwng commented Aug 24, 2020

danwinship commented Aug 24, 2020

derekwaynecarr commented Aug 24, 2020

kikisdeliveryservice commented Aug 27, 2020

ericavonb commented Aug 31, 2020

cgwalters commented Sep 1, 2020 • edited Loading

ericavonb commented Sep 1, 2020

cgwalters commented Sep 2, 2020

romfreiman commented Sep 5, 2020 • edited Loading

cgwalters commented Sep 8, 2020

cgwalters commented Oct 6, 2020

cgwalters commented Oct 6, 2020

cgwalters commented Nov 12, 2020

LorbusChris commented Nov 15, 2020

cgwalters commented Nov 15, 2020

LorbusChris commented Nov 16, 2020 • edited Loading

cgwalters commented Nov 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Troubleshooting

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 24, 2020

crawford commented Nov 24, 2020

cgwalters commented Nov 25, 2020

cgwalters commented Nov 25, 2020 • edited Loading

crawford commented Nov 25, 2020

cgwalters commented Nov 30, 2020

Pre-enhancement: Move secrets from MCS Ignition to pointer config

Background

Problem statement

Proposed solution

Risks and mitigations

Notes and references

openshift-bot commented Mar 1, 2021

openshift-bot commented Mar 31, 2021

openshift-bot commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

cgwalters commented Sep 1, 2020 •

edited

Loading

romfreiman commented Sep 5, 2020 •

edited

Loading

LorbusChris commented Nov 16, 2020 •

edited

Loading

cgwalters commented Nov 25, 2020 •

edited

Loading