-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WRKLDS-1060: Prevent User Workloads from being scheduled on Control Plane nodes #1583
base: master
Are you sure you want to change the base?
WRKLDS-1060: Prevent User Workloads from being scheduled on Control Plane nodes #1583
Conversation
@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Gonna transfer some of the comments we had in our private doc here |
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Co-authored-by: Flavian Missi <fmissi@redhat.com> Co-authored-by: Jan Chaloupka <jchaloup@redhat.com>
27e92a2
to
9c3deed
Compare
@knelasevero: This pull request references WRKLDS-1060 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Co-authored-by: Jan Chaloupka <jchaloup@redhat.com>
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
bc0650b
to
b2dd240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just nits
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
b2dd240
to
68d168b
Compare
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Outdated
Show resolved
Hide resolved
|
||
The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift. | ||
|
||
### Tech Preview -> GA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a discussion of how to know when all OCP payload namespaces have proper permissions for tolerations.
- test to ensure all ocp payload namespaces have proper permissions
- test to ensure all future ocp payload namespaces have proper permissions.
- test to ensure enablement works appropriately.
- test to ensure enablement evicts pods without tolerations.
|
||
The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift. | ||
|
||
### Tech Preview -> GA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should include information about how what information must-gather includes and a test that ensures its there. Don't forget the case where a cluster-admin creates their own VAPB and you need to remember to collect it.
|
||
The necessary feature for this solution (ValidatingAdmissionPolicy) is already available, even though it is in Tech Preview (it is in beta but disabled by default upstream). The evolution of this solution ties with the evolution of [ValidatingAdmissionPolicy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3488-cel-admission-control/README.md) and the decisions to graduate it downstream on Openshift. | ||
|
||
### Tech Preview -> GA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation (would be better as a tool in oc
) that can inspect an OCP cluster (or offline audit log) and produce a list of permissions that are needed to avoid disruption and with an option to auto-produce a directory of manifests that would apply the permissions.
enhancements/scheduling/prevent-user-workloads-from-being-scheduled-on-control-plane-nodes.md
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@knelasevero: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
- As an administrator, I want to restrict non control plane workloads from being scheduled on control plane nodes (even those using `.spec.nodeName` in their pods), so that the control plane components are not at risk of running out of resources. | ||
- As an administrator, I want only certain workloads to be allowed to schedule pods on tainted nodes. | ||
- As an administrator, I want to restrict workloads from being scheduled on special node groups (even those using `.spec.nodeName` in their pods), so that those node groups can only run specialized workloads (e.g., GPU-enabled nodes) without obstruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a valid use-case for this particular problem. With this UC you're entering a territory of deciding which nodes are special, how to manage these, etc. It's rather something that should be a non-goal, since you're targeting only control plane nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this previously as a non-goal. We decided this is now a goal(ish), as the mechanism is flexible enough to allow admins to do whatever they want (and instructions on how to do this were explicitly requested by customers). We are providing the VAP focused on control-plane node protection (and I agree it is the focus of this proposal), but it is trivial to adapt this workflow to protect any type of special node, with the creation of VAP+VAPB, NoExecute taint and necessary RBAC permissions with custom verbs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, I'd still advice against it, even if the mechanism will allow it. The main reason for starting small, is that it's easier to ensure the core use cases are handled properly. You're mentioning this being goal(ish)
which means you're not fully certain about all possible usage patterns, so you're opening a possible bag of worms. Keep it simple, and start small, with the possibility to extend the mechanism with minor modifications if/when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok, I can do that
- As an OpenShift developer, I want to understand the impact of the new scheduling restrictions on existing and future workloads, so that I can design applications that comply with cluster policies and make informed decisions about resource requests and deployment strategies. | ||
- As an end user, I want to receive informative feedback when my workloads are rejected due to being repelled from control plane nodes or special node groups, so that I can make the necessary adjustments without needing extensive support from cluster administrators. | ||
- As a security professional, I want to ensure that this pod rejection mechanism is enforceable and auditable, so that I can verify compliance with internal and external regulations regarding resource access and control plane integrity. | ||
- As an administrator, I want the ability to temporarily override scheduling restrictions for emergency or maintenance tasks without compromising the overall security posture of the cluster, ensuring that critical operations can be performed when necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this particular UC is out-of-scope for this proposal. That's what oc adm cordon
and oc adm drain
are for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as #1583 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll quote here an excerpt from The Zen of Python: There should be one-- and preferably only one --obvious way to do it. - I'm a fan of that kind of approach. Having two distinct ways puts you in the position that you'll need to address differences between the two approaches. Again, simple is the key 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like that excerpt, but I don't think it applies here. Maybe that I need to make the text simpler, but not the mechanism we are providing. We are not providing anything similar to cordon or drain here.
I come back to that other comment of mine: maybe I can phrase this better
. And I'm sure of it now. I'll do that.
My point is that
temporarily override scheduling restrictions for emergency or maintenance tasks without compromising the overall security posture of the cluster
Is just saying that the admin can disable the enforcement of the vap if they need. So it has nothing to do with cordon or drain. Vap + vapb being enforced in this context is the restriction
being mentioned. Overriding the approach here just means that folks can disable the VAP enforcement if needed.
I can phrase this better though.
|
||
- Protect Specialized Resources: Restrict scheduling of regular user workloads on nodes with specialized resources (e.g., GPU-enabled nodes) to ensure these costly resources are reserved for appropriate workloads. | ||
|
||
- Emergency Override Capability: Allow administrators to temporarily override scheduling restrictions for emergency or maintenance tasks, ensuring critical operations can be performed when necessary without compromising the cluster's overall security posture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two are non-goals, and should rather be handled separately from this task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can phrase this better, but I only mean admins can disable the very thing we are proposing here, in case they actually need to schedule stuff on nodes that would be protected if they had this enabled before. This is a goal.
the previous one same as #1583 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other two answers again.
|
||
### Implementation Strategies | ||
|
||
Update OpenShift Components for Taint Tolerance: Modify OpenShift operators and control plane components to include tolerations for the NoExecute taint on control plane nodes that admins can apply to achieve the goals described here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How adding a NoExecute tain will be different from NoSchedule which you've mentioned earlier is already added when mastersSchedulable=true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1709049600594239?thread_ts=1709048784.350439&cid=CKJR6200N
TL;DR The kubelet already natively enforces "if no toleration, the no-execute taint kills pod processes" for us.
So we don't need a new admission plugin, or new controller. We can just rely on existing functionality from scheduler kubelet and api-server with ValidatingAdmissionPolicies, NoExecute taints and RBAC rules with custom verbs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that explanation in the implementation details. I think that's an important detail which I was missing and thus my question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to explain the difference between this new approach and the current.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is clear with the alternatives section? We had to change the whole proposal from that (NoSchedule with new admission plugin) to the new mechanism (NoExecute, Vap,+RBAC) as the new is simpler. I can add a note here that points to that and explains the difference and justification for the discarded approach .
Would that be ok?
|
||
3. Validating Admission Policy Binding: | ||
|
||
- As part of this approach a ValidatigAdmissionPolicyBinding is also provided. The Cluster Administrator updates `validationActions` field to flip the switch and enable actual enforcement. This is the only field where we allow change, but admins can also create their own VAP/VAPB process if they need to. Binding shown here affects all namespaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would administrator fiddle with any of the resources created by the operator? Why not exposing a high-level API, which would allow administrator to turn the feature on or off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion: #1583 (comment)
We can revisit this, of course. Idea here was to keep it simple. Customers can actually already achieve everything described in this EP themselves out of the box, but doing steps manually. We are just providing a standardized way of enabling this to make supporting this easier.
Also this discussion is relevant: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1713811003793789?thread_ts=1713794008.195589&cid=CC3CZCQHM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've skimmed both places and none of the two explicitly moves the responsibility to the cluster admin. I think the intention is to have a high level API to express these options, but the underlying resources are managed and configured by the operator. The majority of our API, if not all is designed with that in mind, to hide the complexity from cluster-admins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not totally explicit, but we moved from the previous approach to avoid having API changes and a new controller or new admission plugin. With that in mind we don't have a place to add a new field or a new API to control this, if the whole idea is to take advantage of existing APIs and existing functionality of the Scheduler, API Server and Kubelet. (Only providing VAP + VAPBs + rbac to end users and admins).
That's why VAP with cell expressions was created, right? So we can let admins have their own rules implemented with CEL expressions and at most we can provide default or standard VAPs when needed (with no new controllers or plugins or new APIs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k would you please chime in? Please let me know if I had the wrong interpretation here.
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
This proposal introduces a mechanism to prevent user workloads from being scheduled on control plane nodes in OpenShift clusters, enhancing cluster security and stability. By default, control plane nodes are protected with a NoSchedule taint to avoid user workload allocation. However, there's currently no safeguard against users who explicitly tolerate this taint or bypass the kube-scheduler by setting the .spec.nodeName directly. This enhancement proposes an admission controller to enforce specified policies and configurable policy management for exceptions.
As an alternative, the proposal also considers using the
NoExecute
taint for control plane nodes, which is recognized by the kubelet and would reject any pod that does not tolerate the taint (reject new pods, or evict pods already there). This approach requires no scheduler awareness and ensures a stricter enforcement mechanism. However, this method requires that all control plane pods be updated to tolerate theNoExecute
taint before it can be applied to control plane nodes to prevent unintended disruptions. (Requires coordination with additional teams, and maintaining artifacts on multiple repositories)