-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resourceModels supports extended resources #4050
Comments
That's because the validation rules restrict the supported resources here. I might need to ask several questions when thinking if we can extend it for introducing another resource, like
I think the GPU also might have the fragmentation issue if you are saying General Cluster Modeling, for example, we have 3 nodes and each nodes has 1 GPU left, Karmada would think the cluster has 3 GPU, thus could assign a job that requires 2 or 3 GPU. Am I right? |
Hi @RainbowMango: |
Hi @wengyao04 I believe it's a reasonable feature. And I asked @chaosi-zju to help with this. He will sync the progress here. |
/assign |
Hi @chaosi-zju thank you for your demon in the community meetup. Do we have a nightly build version of Karmda schedule image and we can try it out on our platform ? |
I just talked to @chaosi-zju this morning, he will send the PR this week. Hopefully, it can be included in the coming v1.8 releases by the end of this month. |
Hi @wengyao04, sorry for delay, I will submit the PR as soon as possible~ |
/kind feature |
Hi @wengyao04 This feature(#4307) has been merged, you can test it with the This feature will be released in release-1.8 by the end of this month. If you want a preview release before the release, please feel free to let me know. |
Hi @RainbowMango and @chaosi-zju Thank you very much to provide this feature ! We will sync the latest images and test it out |
Hi @RainbowMango and @chaosi-zju we test out the latest image and it can categorize our gpu nodes correctly. But we find that the resoucesModel causes potential resources waste (underutilization) because the node is categorized also by the lowest resources. This resources underutilization is even worse if our cluster is mixed with gpu and cpu boxes. Let me give an simple example. Suppose there are 3 nodes in my cluster
If we define resoucesModel like the following.
Then two gpu nodes are in grade 4 and the cpu node is in grade 0. If our two GPU nodes are fully occupied, and users submit a CPU workload requires 10 CPU and 100 Gi memory. This workload cannot be scheduled because the Karmada scheduler think the the cpu node is in grade 0 and don't have enough resources although the cluster summary shows there are enough allocatable resources. I can put more cpu/memory in grade 0 and resources underutilization always exists. Could we have suggestion from the community to properly set the resourcesModel ? Thank you ! |
I'm surprised that, given 72 CPUs on each node, I suppose the CPU node should be in grade |
Hi @RainbowMango In our real clusters we have total 6 nodes: 4 gpu and 2 cpu boxes, this is the summary from cluster status
The 4 CPU nodes are categorized at grade 0. I see you have this logic https://github.com/karmada-io/karmada/blob/master/pkg/modeling/modeling.go#L123 but cpu node always have GPU quantity 0 and cannot be categorized to other grades. I can increase the grade 0 cpu/memory but it will cause GPU resource waste.
Then I will have 4 cpu nodes in grade 0 and 2 gpu nodes in grade 4 |
I think in our current situation, we probably disable resoucesModel feature gate and just use resource summary during the scheduling. For the resource fragmentation issues, we probably enable volcano gang scheduler on our member clusters to avoid partial running jobs and surface a clear message to the clients. I think there is always a tradeoff if we cannot cache all member cluster's nodes in the scheduler cache. |
@wengyao04 please give me some time to think about it, I will feedback as soon as possible~ |
Hi @wengyao04 For your scenario, using ResourceModel may be really not the most suitable choice, because ResourceModel is supposed to be a rough estimation, not that so precise. And, in your scenario, the shortcomings of ResourceModel are more clearly exposed since:
However, it doesn't mean Karmada not support your scenario, there are other ways in Karmada if you need more accurate scheduler ! Another option is to use
I'll write another demo using Besides, I want to known which installation method you used ? If you have problem in install |
@wengyao04 the demo of |
Hi @chaosi-zju Thank you very much. I disable One small issue is that the helm char only supporting one cluster https://github.com/karmada-io/karmada/blob/master/charts/karmada/values.yaml#L742 I manually install one for another member cluster. could we make them as a list ? |
Got it, it makes sense. I will support it~ |
Hi @wengyao04 I have evaluated that it's feasible to change the estimator in helm to list format, can you create a Meanwhile, since you are using the helm installation method, I would like to ask you if there is anything you find troublesome in the process of helm installation? or what needs to be improved in the installation experience? Can you provide us with suggestions to improve the installation experience? You can create another issue that describes what your ideal perfect installation would look like. |
Hi @chaosi-zju Thank you very much. I submit a separate issue #4368 |
Please provide an in-depth description of the question you have:
I am able to register clusters with push mode, and use default resourceModel which only supports cpu, memory, ephemeral-storage and storage.
When I add extended resources like gpu in the resourceModels in the cluster like
I get
My understanding is that General Cluster Modeling use the resourceSummary to check the allocatable/allocated to schedule the pod.
But we also want to have GPU in Customized Cluster Modeling, I don't think gpu will have the fragmentation issues like gpu/memory when using general cluster modeling, as ppl cannot claim partial gpu. But it would be still preferred to have extended resources in the customized cluster modeling just make it consistent with cluster resources ?
I also run Karmada dashboard, it shows cpu, memory and storage. But it would be nice to show extended resources ?
What do you think about this question?:
Environment:
The text was updated successfully, but these errors were encountered: