Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resourceModels supports extended resources #4050

Closed
wengyao04 opened this issue Sep 11, 2023 · 21 comments · Fixed by #4307
Closed

resourceModels supports extended resources #4050

wengyao04 opened this issue Sep 11, 2023 · 21 comments · Fixed by #4307
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/question Indicates an issue that is a support question.
Milestone

Comments

@wengyao04
Copy link
Contributor

Please provide an in-depth description of the question you have:
I am able to register clusters with push mode, and use default resourceModel which only supports cpu, memory, ephemeral-storage and storage.
When I add extended resources like gpu in the resourceModels in the cluster like

  resourceModels:
  - grade: 0
    ranges:
    - max: "72"
      min: "0"
      name: cpu
    - max: 560Gi
      min: "0"
      name: memory
    - max: "0"
      min: "0"
      name: nvidia.com/gpu
    - max: "0"
      min: "0"
      name: myexample.com/gpu-v100
  - grade: 1
    ranges:
    - max: "96"
      min: "72"
      name: cpu
    - max: 1.6Ti
      min: 560Gi
      name: memory
    - max: "4"
      min: "0"
      name: nvidia.com/gpu
    - max: "4"
      min: "0"
      name: myexample.com/gpu-v100

I get

Unsupported value: "nvidia.com/gpu": supported values: "cpu", "ephemeral-storage", "memory", "storage"

My understanding is that General Cluster Modeling use the resourceSummary to check the allocatable/allocated to schedule the pod.
But we also want to have GPU in Customized Cluster Modeling, I don't think gpu will have the fragmentation issues like gpu/memory when using general cluster modeling, as ppl cannot claim partial gpu. But it would be still preferred to have extended resources in the customized cluster modeling just make it consistent with cluster resources ?
I also run Karmada dashboard, it shows cpu, memory and storage. But it would be nice to show extended resources ?
What do you think about this question?:

Environment:

  • Karmada version: latest
  • Kubernetes version: 1.24
  • Others:
@wengyao04 wengyao04 added the kind/question Indicates an issue that is a support question. label Sep 11, 2023
@RainbowMango RainbowMango added this to the v1.8 milestone Sep 12, 2023
@RainbowMango
Copy link
Member

Unsupported value: "nvidia.com/gpu": supported values: "cpu", "ephemeral-storage", "memory", "storage"

That's because the validation rules restrict the supported resources here. I might need to ask several questions when thinking if we can extend it for introducing another resource, like nvidia.com/gpu.

I don't think gpu will have the fragmentation issues like gpu/memory when using general cluster modeling, as ppl cannot claim partial gpu. But it would be still preferred to have extended resources in the customized cluster modeling just make it consistent with cluster resources ?

I think the GPU also might have the fragmentation issue if you are saying General Cluster Modeling, for example, we have 3 nodes and each nodes has 1 GPU left, Karmada would think the cluster has 3 GPU, thus could assign a job that requires 2 or 3 GPU. Am I right?

@wengyao04
Copy link
Contributor Author

Hi @RainbowMango:
You are right, we also have GUP fragmentation. In the example, we have 4 nodes and each nodes has 1 GPU left, if we require one master and one worker each of which requires 2 GPU, it won't work. It would be helpful to support extended resource in the the resouceModel

@RainbowMango
Copy link
Member

Hi @wengyao04 I believe it's a reasonable feature. And I asked @chaosi-zju to help with this. He will sync the progress here.

@chaosi-zju
Copy link
Member

/assign

@wengyao04
Copy link
Contributor Author

Hi @chaosi-zju thank you for your demon in the community meetup. Do we have a nightly build version of Karmda schedule image and we can try it out on our platform ?

@RainbowMango
Copy link
Member

I just talked to @chaosi-zju this morning, he will send the PR this week. Hopefully, it can be included in the coming v1.8 releases by the end of this month.

@chaosi-zju
Copy link
Member

Hi @chaosi-zju thank you for your demon in the community meetup. Do we have a nightly build version of Karmda schedule image and we can try it out on our platform ?

Hi @wengyao04, sorry for delay, I will submit the PR as soon as possible~

@RainbowMango
Copy link
Member

/kind feature

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 24, 2023
@RainbowMango
Copy link
Member

Hi @wengyao04 This feature(#4307) has been merged, you can test it with the latest image now.
Thanks for spotting this, your feedback means a lot to the community.

This feature will be released in release-1.8 by the end of this month. If you want a preview release before the release, please feel free to let me know.

@wengyao04
Copy link
Contributor Author

Hi @RainbowMango and @chaosi-zju Thank you very much to provide this feature ! We will sync the latest images and test it out

@wengyao04
Copy link
Contributor Author

Hi @RainbowMango and @chaosi-zju we test out the latest image and it can categorize our gpu nodes correctly. But we find that the resoucesModel causes potential resources waste (underutilization) because the node is categorized also by the lowest resources. This resources underutilization is even worse if our cluster is mixed with gpu and cpu boxes.

Let me give an simple example. Suppose there are 3 nodes in my cluster

  • node 1: 72 CPUs, 512Gi Memory, 4 GPUs
  • node 2: 72 CPUs, 512Gi Memory, 4 GPUs
  • node 3: 72 CPUs, 512Gi Memory, 0 GPUs

If we define resoucesModel like the following.

resourceModels:
- grade: 0
 ranges:
  - min: "0"
    max: "4"
    name: cpu
  - min: "0"
    max: 16Gi
    name: memory
  - min: "0"
    max: "1"
    name: nvidia.com/gpu
- grade: 1
  ranges:
  - min: "4"
    max: "16"
    name: cpu
  - min: 16Gi
    max: 128Gi
    name: memory
  - min: "1"
    max: "2"
    name: nvidia.com/gpu
- grade: 2
  ranges:
  - min: "16"
    max: "32"
    name: cpu
  - min: 128Gi
    max: 256Gi
    name: memory
  - min: "2"
    max: "3"
    name: nvidia.com/gpu
- grade: 3
  ranges:
  - min: "32"
    max: "48"
    name: cpu
  - min: 256Gi
    max: 384Gi
    name: memory
  - min: "3"
    max: "4"
    name: nvidia.com/gpu
- grade: 4
  ranges:
  - min: "48"
    max: "9223372036854775807"
    name: cpu
  - min: 384Gi
    max: "9223372036854775807"
    name: memory
  - min: "4"
    max: "9223372036854775807"
    name: nvidia.com/gpu

Then two gpu nodes are in grade 4 and the cpu node is in grade 0. If our two GPU nodes are fully occupied, and users submit a CPU workload requires 10 CPU and 100 Gi memory. This workload cannot be scheduled because the Karmada scheduler think the the cpu node is in grade 0 and don't have enough resources although the cluster summary shows there are enough allocatable resources.

I can put more cpu/memory in grade 0 and resources underutilization always exists. Could we have suggestion from the community to properly set the resourcesModel ?

Thank you !

@RainbowMango
Copy link
Member

Then two gpu nodes are in grade 4 and the cpu node is in grade 0.

I'm surprised that, given 72 CPUs on each node, I suppose the CPU node should be in grade 4 ([48, ~).
Can you share the status of the testing cluster? Including the resource module configuration in .spec.resourceModels and the resourceSummary in .status.resourceSummary?

@wengyao04
Copy link
Contributor Author

wengyao04 commented Nov 28, 2023

Hi @RainbowMango In our real clusters we have total 6 nodes: 4 gpu and 2 cpu boxes, this is the summary from cluster status

  kubernetesVersion: v1.24.13
  nodeSummary:
    readyNum: 6
    totalNum: 6
  resourceSummary:
    allocatable:
      cpu: "432"
      ephemeral-storage: "16767979331679"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 3236824812Ki
      nvidia.com/gpu: "8"
      pods: "660"
    allocatableModelings:
    - count: 4
      grade: 0
    - count: 0
      grade: 1
    - count: 0
      grade: 2
    - count: 0
      grade: 3
    - count: 2
      grade: 4
    allocated:
      cpu: 51835m
      memory: 36420050Ki
      pods: "145"

The 4 CPU nodes are categorized at grade 0. I see you have this logic https://github.com/karmada-io/karmada/blob/master/pkg/modeling/modeling.go#L123 but cpu node always have GPU quantity 0 and cannot be categorized to other grades.

I can increase the grade 0 cpu/memory but it will cause GPU resource waste.
For a simple example if my resourcesModel is like

resourceModels:
- grade: 0
 ranges:
  - min: "0"
    max: "32"
    name: cpu
  - min: "0"
    max: 256Gi
    name: memory
  - min: "0"
    max: "1"
    name: nvidia.com/gpu
- grade: 1
  ranges:
  - min: "32"
    max: "40"
    name: cpu
  - min: 256Gi
    max: 320Gi
    name: memory
  - min: "1"
    max: "2"
    name: nvidia.com/gpu
- grade: 2
  ranges:
  - min: "40"
    max: "48"
    name: cpu
  - min: 320Gi
    max: 384Gi
    name: memory
  - min: "2"
    max: "3"
    name: nvidia.com/gpu
- grade: 3
  ranges:
  - min: "48"
    max: "56"
    name: cpu
  - min: 384Gi
    max: 464Gi
    name: memory
  - min: "3"
    max: "4"
    name: nvidia.com/gpu
- grade: 4
  ranges:
  - min: "56"
    max: "9223372036854775807"
    name: cpu
  - min: 464Gi
    max: "9223372036854775807"
    name: memory
  - min: "4"
    max: "9223372036854775807"
    name: nvidia.com/gpu

Then I will have 4 cpu nodes in grade 0 and 2 gpu nodes in grade 4
If I have two gpu workloads each of them using 1 GPU but 40 cpu and 260 Gi memory. Although each gpu node still has 3 GPUs, 32 cpu and 452 Gi memory, the gpu nodes is still categorized as grade 0 and cause gpu resource under utilization.

@wengyao04
Copy link
Contributor Author

I think in our current situation, we probably disable resoucesModel feature gate and just use resource summary during the scheduling. For the resource fragmentation issues, we probably enable volcano gang scheduler on our member clusters to avoid partial running jobs and surface a clear message to the clients.

I think there is always a tradeoff if we cannot cache all member cluster's nodes in the scheduler cache.

@chaosi-zju
Copy link
Member

@wengyao04 please give me some time to think about it, I will feedback as soon as possible~

@chaosi-zju
Copy link
Member

chaosi-zju commented Nov 28, 2023

Hi @wengyao04

For your scenario, using ResourceModel may be really not the most suitable choice, because ResourceModel is supposed to be a rough estimation, not that so precise. And, in your scenario, the shortcomings of ResourceModel are more clearly exposed since:

  1. You have a large range of CPU and Memory, for example, CPU 0 ~ 72C, MEM 0 ~ 560Gi, and a small range of GPU only 0~4, which makes it really difficult to divide the ResourceModel.
  2. ResourceModel fits the positive correlation between the amount of resources required, for example the more CPU a workload requires, the more memory it tends to require, so CPU and memory are positively correlated. However, GPUs are not so strongly correlated with them.

However, it doesn't mean Karmada not support your scenario, there are other ways in Karmada if you need more accurate scheduler !

Another option is to use karmada-scheduler-estimator, the downside is that you need to deploy an additional component (means costing more resource), you can refer to Cluster Accurate Scheduler Estimator For Rescheduling for more information.

This component list/watch the node object of every member cluster and have a accurate overview of remaining resource of each node.

I'll write another demo using karmada-scheduler-estimator specifically according to your scenario in following comments~

Besides, I want to known which installation method you used ? If you have problem in install karmada-scheduler-estimator component, feel free to ask me~

@chaosi-zju
Copy link
Member

@wengyao04 the demo of karmada-scheduler-estimator: https://h3ld32xlpo.feishu.cn/wiki/V7shw9Q3kiGkCak4ELocSzDen4g

@wengyao04
Copy link
Contributor Author

Hi @chaosi-zju Thank you very much. I disable CustomizedClusterResourceModeling and deploy the cluster estimator. It meets our requirements.

One small issue is that the helm char only supporting one cluster https://github.com/karmada-io/karmada/blob/master/charts/karmada/values.yaml#L742 I manually install one for another member cluster. could we make them as a list ?

@chaosi-zju
Copy link
Member

could we make them as a list ?

Got it, it makes sense. I will support it~

@chaosi-zju
Copy link
Member

Hi @wengyao04

I have evaluated that it's feasible to change the estimator in helm to list format, can you create a Issue in Karmada repo for me, I will try to submit a related PR to achieve it in this week~

Meanwhile, since you are using the helm installation method, I would like to ask you if there is anything you find troublesome in the process of helm installation? or what needs to be improved in the installation experience? Can you provide us with suggestions to improve the installation experience? You can create another issue that describes what your ideal perfect installation would look like.

@wengyao04
Copy link
Contributor Author

Hi @chaosi-zju Thank you very much. I submit a separate issue #4368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/question Indicates an issue that is a support question.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants