Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WRKLDS-1060: Prevent User Workloads from being scheduled on Control Plane nodes #1583

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

knelasevero
Copy link

@knelasevero knelasevero commented Mar 5, 2024

This proposal introduces a mechanism to prevent user workloads from being scheduled on control plane nodes in OpenShift clusters, enhancing cluster security and stability. By default, control plane nodes are protected with a NoSchedule taint to avoid user workload allocation. However, there's currently no safeguard against users who explicitly tolerate this taint or bypass the kube-scheduler by setting the .spec.nodeName directly. This enhancement proposes an admission controller to enforce specified policies and configurable policy management for exceptions.

As an alternative, the proposal also considers using the NoExecute taint for control plane nodes, which is recognized by the kubelet and would reject any pod that does not tolerate the taint (reject new pods, or evict pods already there). This approach requires no scheduler awareness and ensures a stricter enforcement mechanism. However, this method requires that all control plane pods be updated to tolerate the NoExecute taint before it can be applied to control plane nodes to prevent unintended disruptions. (Requires coordination with additional teams, and maintaining artifacts on multiple repositories)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 5, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 5, 2024

@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue.

In response to this:

This proposal introduces a mechanism to prevent user workloads from being scheduled on control plane nodes in OpenShift clusters, enhancing cluster security and stability. By default, control plane nodes are protected with a NoSchedule taint to avoid user workload allocation. However, there's currently no safeguard against users who explicitly tolerate this taint or bypass the kube-scheduler by setting the .spec.nodeName directly. This enhancement proposes an admission controller to enforce specified policies and configurable policy management for exceptions.

As an alternative, the proposal also considers using the NoExecute taint for control plane nodes, which is recognized by the kubelet and would reject any pod that does not tolerate the taint (reject new pods, or evict pods already there). This approach requires no scheduler awareness and ensures a stricter enforcement mechanism. However, this method requires that all control plane pods be updated to tolerate the NoExecute taint before it can be applied to control plane nodes to prevent unintended disruptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@knelasevero
Copy link
Author

Gonna transfer some of the comments we had in our private doc here

@openshift-ci openshift-ci bot requested review from enxebre and tkashem March 5, 2024 14:07
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 5, 2024

@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue.

In response to this:

This proposal introduces a mechanism to prevent user workloads from being scheduled on control plane nodes in OpenShift clusters, enhancing cluster security and stability. By default, control plane nodes are protected with a NoSchedule taint to avoid user workload allocation. However, there's currently no safeguard against users who explicitly tolerate this taint or bypass the kube-scheduler by setting the .spec.nodeName directly. This enhancement proposes an admission controller to enforce specified policies and configurable policy management for exceptions.

As an alternative, the proposal also considers using the NoExecute taint for control plane nodes, which is recognized by the kubelet and would reject any pod that does not tolerate the taint (reject new pods, or evict pods already there). This approach requires no scheduler awareness and ensures a stricter enforcement mechanism. However, this method requires that all control plane pods be updated to tolerate the NoExecute taint before it can be applied to control plane nodes to prevent unintended disruptions. (Requires coordination with additional teams, and maintaining artifacts on multiple repositories)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@knelasevero knelasevero marked this pull request as draft March 5, 2024 14:13
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2024
Co-authored-by: Flavian Missi <fmissi@redhat.com>
Co-authored-by: Jan Chaloupka <jchaloup@redhat.com>
@knelasevero knelasevero force-pushed the WRKLDS-1060-prevent-workloads-control-plane-nodes branch from 27e92a2 to 9c3deed Compare March 5, 2024 14:25
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 6, 2024

@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue.

In response to this:

This proposal introduces a mechanism to prevent user workloads from being scheduled on control plane nodes in OpenShift clusters, enhancing cluster security and stability. By default, control plane nodes are protected with a NoSchedule taint to avoid user workload allocation. However, there's currently no safeguard against users who explicitly tolerate this taint or bypass the kube-scheduler by setting the .spec.nodeName directly. This enhancement proposes an admission controller to enforce specified policies and configurable policy management for exceptions.

As an alternative, the proposal also considers using the NoExecute taint for control plane nodes, which is recognized by the kubelet and would reject any pod that does not tolerate the taint (reject new pods, or evict pods already there). This approach requires no scheduler awareness and ensures a stricter enforcement mechanism. However, this method requires that all control plane pods be updated to tolerate the NoExecute taint before it can be applied to control plane nodes to prevent unintended disruptions. (Requires coordination with additional teams, and maintaining artifacts on multiple repositories)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

knelasevero and others added 2 commits March 6, 2024 08:12
Co-authored-by: Jan Chaloupka <jchaloup@redhat.com>
@knelasevero knelasevero force-pushed the WRKLDS-1060-prevent-workloads-control-plane-nodes branch 2 times, most recently from bc0650b to b2dd240 Compare March 14, 2024 14:34
Copy link
Member

@ingvagabund ingvagabund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just nits

@knelasevero knelasevero force-pushed the WRKLDS-1060-prevent-workloads-control-plane-nodes branch from b2dd240 to 68d168b Compare March 27, 2024 09:48
@knelasevero knelasevero marked this pull request as ready for review March 27, 2024 12:27
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2024

The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift.

### Tech Preview -> GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a discussion of how to know when all OCP payload namespaces have proper permissions for tolerations.

  1. test to ensure all ocp payload namespaces have proper permissions
  2. test to ensure all future ocp payload namespaces have proper permissions.
  3. test to ensure enablement works appropriately.
  4. test to ensure enablement evicts pods without tolerations.


The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift.

### Tech Preview -> GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should include information about how what information must-gather includes and a test that ensures its there. Don't forget the case where a cluster-admin creates their own VAPB and you need to remember to collect it.


The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift.

### Tech Preview -> GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation (would be better as a tool in oc) that can inspect an OCP cluster (or offline audit log) and produce a list of permissions that are needed to avoid disruption and with an option to auto-produce a directory of manifests that would apply the permissions.

Copy link
Contributor

openshift-ci bot commented Jun 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jun 19, 2024

@knelasevero: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


- As an administrator, I want to restrict non control plane workloads from being scheduled on control plane nodes (even those using `.spec.nodeName` in their pods), so that the control plane components are not at risk of running out of resources.
- As an administrator, I want only certain workloads to be allowed to schedule pods on tainted nodes.
- As an administrator, I want to restrict workloads from being scheduled on special node groups (even those using `.spec.nodeName` in their pods), so that those node groups can only run specialized workloads (e.g., GPU-enabled nodes) without obstruction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a valid use-case for this particular problem. With this UC you're entering a territory of deciding which nodes are special, how to manage these, etc. It's rather something that should be a non-goal, since you're targeting only control plane nodes.

Copy link
Author

@knelasevero knelasevero Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this previously as a non-goal. We decided this is now a goal(ish), as the mechanism is flexible enough to allow admins to do whatever they want (and instructions on how to do this were explicitly requested by customers). We are providing the VAP focused on control-plane node protection (and I agree it is the focus of this proposal), but it is trivial to adapt this workflow to protect any type of special node, with the creation of VAP+VAPB, NoExecute taint and necessary RBAC permissions with custom verbs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I'd still advice against it, even if the mechanism will allow it. The main reason for starting small, is that it's easier to ensure the core use cases are handled properly. You're mentioning this being goal(ish) which means you're not fully certain about all possible usage patterns, so you're opening a possible bag of worms. Keep it simple, and start small, with the possibility to extend the mechanism with minor modifications if/when necessary.

Copy link
Author

@knelasevero knelasevero Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok, I can do that

- As an OpenShift developer, I want to understand the impact of the new scheduling restrictions on existing and future workloads, so that I can design applications that comply with cluster policies and make informed decisions about resource requests and deployment strategies.
- As an end user, I want to receive informative feedback when my workloads are rejected due to being repelled from control plane nodes or special node groups, so that I can make the necessary adjustments without needing extensive support from cluster administrators.
- As a security professional, I want to ensure that this pod rejection mechanism is enforceable and auditable, so that I can verify compliance with internal and external regulations regarding resource access and control plane integrity.
- As an administrator, I want the ability to temporarily override scheduling restrictions for emergency or maintenance tasks without compromising the overall security posture of the cluster, ensuring that critical operations can be performed when necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this particular UC is out-of-scope for this proposal. That's what oc adm cordon and oc adm drain are for.

Copy link
Author

@knelasevero knelasevero Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as #1583 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll quote here an excerpt from The Zen of Python: There should be one-- and preferably only one --obvious way to do it. - I'm a fan of that kind of approach. Having two distinct ways puts you in the position that you'll need to address differences between the two approaches. Again, simple is the key 😉

Copy link
Author

@knelasevero knelasevero Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like that excerpt, but I don't think it applies here. Maybe that I need to make the text simpler, but not the mechanism we are providing. We are not providing anything similar to cordon or drain here.

I come back to that other comment of mine: maybe I can phrase this better. And I'm sure of it now. I'll do that.

My point is that

temporarily override scheduling restrictions for emergency or maintenance tasks without compromising the overall security posture of the cluster

Is just saying that the admin can disable the enforcement of the vap if they need. So it has nothing to do with cordon or drain. Vap + vapb being enforced in this context is the restriction being mentioned. Overriding the approach here just means that folks can disable the VAP enforcement if needed.

I can phrase this better though.


- Protect Specialized Resources: Restrict scheduling of regular user workloads on nodes with specialized resources (e.g., GPU-enabled nodes) to ensure these costly resources are reserved for appropriate workloads.

- Emergency Override Capability: Allow administrators to temporarily override scheduling restrictions for emergency or maintenance tasks, ensuring critical operations can be performed when necessary without compromising the cluster's overall security posture.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two are non-goals, and should rather be handled separately from this task.

Copy link
Author

@knelasevero knelasevero Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can phrase this better, but I only mean admins can disable the very thing we are proposing here, in case they actually need to schedule stuff on nodes that would be protected if they had this enabled before. This is a goal.

the previous one same as #1583 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other two answers again.


### Implementation Strategies

Update OpenShift Components for Taint Tolerance: Modify OpenShift operators and control plane components to include tolerations for the NoExecute taint on control plane nodes that admins can apply to achieve the goals described here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How adding a NoExecute tain will be different from NoSchedule which you've mentioned earlier is already added when mastersSchedulable=true?

Copy link
Author

@knelasevero knelasevero Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1709049600594239?thread_ts=1709048784.350439&cid=CKJR6200N

TL;DR The kubelet already natively enforces "if no toleration, the no-execute taint kills pod processes" for us.

So we don't need a new admission plugin, or new controller. We can just rely on existing functionality from scheduler kubelet and api-server with ValidatingAdmissionPolicies, NoExecute taints and RBAC rules with custom verbs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that explanation in the implementation details. I think that's an important detail which I was missing and thus my question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to explain the difference between this new approach and the current.

Copy link
Author

@knelasevero knelasevero Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clear with the alternatives section? We had to change the whole proposal from that (NoSchedule with new admission plugin) to the new mechanism (NoExecute, Vap,+RBAC) as the new is simpler. I can add a note here that points to that and explains the difference and justification for the discarded approach .

Would that be ok?


3. Validating Admission Policy Binding:

- As part of this approach a ValidatigAdmissionPolicyBinding is also provided. The Cluster Administrator updates `validationActions` field to flip the switch and enable actual enforcement. This is the only field where we allow change, but admins can also create their own VAP/VAPB process if they need to. Binding shown here affects all namespaces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would administrator fiddle with any of the resources created by the operator? Why not exposing a high-level API, which would allow administrator to turn the feature on or off?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion: #1583 (comment)

We can revisit this, of course. Idea here was to keep it simple. Customers can actually already achieve everything described in this EP themselves out of the box, but doing steps manually. We are just providing a standardized way of enabling this to make supporting this easier.

Also this discussion is relevant: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1713811003793789?thread_ts=1713794008.195589&cid=CC3CZCQHM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've skimmed both places and none of the two explicitly moves the responsibility to the cluster admin. I think the intention is to have a high level API to express these options, but the underlying resources are managed and configured by the operator. The majority of our API, if not all is designed with that in mind, to hide the complexity from cluster-admins.

Copy link
Author

@knelasevero knelasevero Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not totally explicit, but we moved from the previous approach to avoid having API changes and a new controller or new admission plugin. With that in mind we don't have a place to add a new field or a new API to control this, if the whole idea is to take advantage of existing APIs and existing functionality of the Scheduler, API Server and Kubelet. (Only providing VAP + VAPBs + rbac to end users and admins).

That's why VAP with cell expressions was created, right? So we can let admins have their own rules implemented with CEL expressions and at most we can provide default or standard VAPs when needed (with no new controllers or plugins or new APIs).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k would you please chime in? Please let me know if I had the wrong interpretation here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants