Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] add a schedule plugin that support pod expands and shrinks according to the order of the defined logical node set #475

Open
fjding opened this issue Jan 7, 2023 · 29 comments

Comments

@fjding
Copy link

fjding commented Jan 7, 2023

The kubernetes support pod-deletion-cost after v1.21, in my cloud scene, the user have the demands like this:
1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time.
2、At the same time, it also supports the maximum number of schedulable pods per node set。
BTW, I have implemented this feature and want to contribute it to the community, I hope everyone can discuss it together

@fjding fjding changed the title add a schedule plugin that support pod expands and shrinks according to the order of the defined logical node set [proposal] add a schedule plugin that support pod expands and shrinks according to the order of the defined logical node set Jan 7, 2023
@KunWuLuan
Copy link
Contributor

My company also have a similar plugin like yours. We can find a time to have a discuss.

@fjding
Copy link
Author

fjding commented Apr 3, 2023

My company also have a similar plugin like yours. We can find a time to have a discuss.

Hi, we can collaborate on this proposal

@fjding
Copy link
Author

fjding commented Apr 3, 2023

@Huang-Wei @ffromani @seanmalloy @denkensk
Could you take a look at this proposal? We can discuss whether we need to create a keps

@ffromani
Copy link
Contributor

ffromani commented Apr 3, 2023

@Huang-Wei @ffromani @seanmalloy @denkensk Could you take a look at this proposal? We can discuss whether we need to create a keps

I'll have a look later this week (beginning April 3 2023)

@Huang-Wei
Copy link
Contributor

It will help us understanding the motivation(s) if you can elaborate on the real-world use-cases.


1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time.

What do you mean by "node set order"? is that a priority field of the NodeSet CR?

How a deployment's replicas are expected to be scheduled onto the matching NodeSets? and is the scheduling directives a hard or soft constraint?

2、At the same time, it also supports the maximum number of schedulable pods per node set

Where is this max num defined?

@fjding
Copy link
Author

fjding commented Apr 3, 2023

It will help us understanding the motivation(s) if you can elaborate on the real-world use-cases.

1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time.

What do you mean by "node set order"? is that a priority field of the NodeSet CR?

How a deployment's replicas are expected to be scheduled onto the matching NodeSets? and is the scheduling directives a hard or soft constraint?

2、At the same time, it also supports the maximum number of schedulable pods per node set

Where is this max num defined?

Hi, Thank you for your attention!
The motivation:
In cloud scenarios, some users prefer to use ECS first. When ECS is insufficient, they will consider using elastic containers, such as Alibaba Cloud's ECI. Because the cost of using ecs will be lower than the cost of eci.
@KunWuLuan Can you add your usage scenarios?

We will define CRD named ResourcePoliy, it's CR instance as follows:
image

Because ecs-pool is ranked before eci-pool, pods will be scheduled to ecs-pool first. If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool

@KunWuLuan
Copy link
Contributor

KunWuLuan commented Apr 3, 2023

In our company's scenario, customers will deploy both spot instances and pay-as-you-go instances simultaneously. Customers want their business to run on spot instances first to save costs, and when spot instance resources are insufficient, they will run on pay-as-you-go instances. Moreover, during business peak periods, when neither type of instance has resources, the business Pod will be scheduled to ECI nodes.
In this case, they will deploy a resourcepolicy as follows:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: xxx
  namespace: xxx
spec:
  selector:
    key1: value1
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      type: spot
  - resource: ecs
    nodeSelector:
      type: pay-as-you-go
  - resource: eci

@Huang-Wei
Copy link
Contributor

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

I'm open to host an abstracted version in scheduler-plugins.

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

cc @denkensk

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

Hmmm. I know it. Actually, I am the author of this feature in Alibaba cloud 😄. It takes me a long time to think of this name ResourcePoliy 😄 @fjding Did you reference this implementation before?

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool.

Can you introduce your scenario for this? And Why do you need to schedule 100 to ecs-pool first? @fjding

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

Your comment is very useful in a real production environment. And I also care about this efficiency and memory usage if we need to memorize some history or status before. @Huang-Wei

@fjding
Copy link
Author

fjding commented Apr 4, 2023

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

I'm open to host an abstracted version in scheduler-plugins.

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

cc @denkensk
The proposal I provided is being used on ByteDance's Volcano Engine, and the design was inspired by Alibaba Cloud's implementation. However, I personally think that maxReplicas is very useful, as in the following scenarios.
image
Users expect a Deployment's Pods to be distributed across different AZ (Available Zone) in a certain proportion.

A cluster has multiple AZs(Available Zones), and each AZ has a VK (virtual kubelet),Users expect a Deployment's Pods to be distributed across different AZs in a certain proportion.

@fjding
Copy link
Author

fjding commented Apr 4, 2023

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

Hmmm. I know it. Actually, I am the author of this feature in Alibaba cloud 😄. It takes me a long time to think of this name ResourcePoliy 😄 @fjding Did you reference this implementation before?

@denkensk
Yes,The design was inspired by Alibaba Cloud's implementation,At the same time, some other functions were added

@fjding
Copy link
Author

fjding commented Apr 4, 2023

If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool.

Can you introduce your scenario for this? And Why do you need to schedule 100 to ecs-pool first? @fjding

As I gave an example above, multi-AZ deployment is a good case, the openkruise also provides some cases.link

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

A cluster has multiple AZs(Available Zones), and each AZ has a VK (virtual kubelet),Users expect a Deployment's Pods to be distributed across different AZs in a certain proportion.
https://www.volcengine.com/docs/6460/177068

Thanks for your explanation @fjding . And I'm also glad that these ideas can be applied to your scenario. And also scheduler-plugins can be used in ByteDance's Volcano Engine

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

And I think we also need to clarify the core requirements. If you want to deploy the pods across different AZs, why use Max other than Must? Because according to my experience, users always want to make sure the proportion is required other than prefer. @fjding

@KunWuLuan Do you have feedback from other users or more needs for a "resource policy"? We can discuss it here and make a more generic design together.

@fjding
Copy link
Author

fjding commented Apr 4, 2023

@denkensk Users often use multi-AZ scenarios for disaster recovery purposes. In elastic container scenarios, such as ByteDance's VCI, users cannot accurately predict the upper limit of VCI capacity. Therefore, they cannot disable the launch of a pod just because resources in one AZ are unavailable.

@fjding
Copy link
Author

fjding commented Apr 4, 2023

And I think we also need to clarify the core requirements. If you want to deploy the pods across different AZs, why use Max other than Must? Because according to my experience, users always want to make sure the proportion is required other than prefer. @fjding

@KunWuLuan Do you have feedback from other users or more needs for a "resource policy"? We can discuss it here and make a more generic design together.

BTW, the strategy is required and maxReplica can meet the scenario you mentioned “Must".

@KunWuLuan
Copy link
Contributor

Do you have feedback from other users or more needs for a "resource policy"?

In my cloud scene. Our users will use ResourcePolicy to run a fixed number of Pods on ECS nodes (like maxReplicas in this design) and schedule the Pods that are scaled out during peak periods to Spot instances or ECI .

@fjding
Copy link
Author

fjding commented Apr 7, 2023

@Huang-Wei @denkensk @ffromani
After the above discussion, do you have any other questions? Can we now propose a complete KEP?
cc @KunWuLuan

@fjding
Copy link
Author

fjding commented Apr 20, 2023

@Huang-Wei
Hi, Are there any other issues with this proposal? If not, can we proceed with writing a KEPS document?

@Huang-Wei
Copy link
Contributor

Sure, please go ahead to raise a KEP. We can continue the discussion in the KEP. Just keep in mind this repo is more focusing on the scheduling portion, and may leave the discussion of CRD spec details outside.

@fjding
Copy link
Author

fjding commented Apr 21, 2023

Sure, please go ahead to raise a KEP. We can continue the discussion in the KEP. Just keep in mind this repo is more focusing on the scheduling portion, and may leave the discussion of CRD spec details outside.

Thanks, @KunWuLuan we can do it together now

@KunWuLuan
Copy link
Contributor

@fjding Hi, I have submit a draft for this feature.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2024
@KunWuLuan
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@KunWuLuan
Copy link
Contributor

This CRD is widely used in both my company and fjding's, we have selected the same features we meet for our customers in the proposal. So we think that the CRD that we described in the proposal is a stable version and it will not be updated frequently.
Maybe we can host this CRD in scheduler-plugins instead of other place. HDYT? @fjding
cc @ffromani @Huang-Wei

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2024
@KunWuLuan
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants