Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] improve ElasticQuota #1837

Closed
eahydra opened this issue Jan 13, 2024 · 2 comments
Closed

[proposal] improve ElasticQuota #1837

eahydra opened this issue Jan 13, 2024 · 2 comments
Labels
area/koord-scheduler enhancement New feature or request kind/proposal Create a report to help us improve lifecycle/stale

Comments

@eahydra
Copy link
Member

eahydra commented Jan 13, 2024

What is your proposal:

ElasticQuota feature enhancements overview

ElasticQuota is a key feature of the Koordinator project and has been supported since the early days. It is not only compatible with the original ElasticQuota CRD, but has also undergone the following enhancements:

  • Tree Structure Management: Allows resources to be divided by organizational structure or workload type.
  • Weight support: Provides resource quota allocation based on weight, and those with higher weights get more quotas.
  • Fairness Guarantee: Implement a resource quota allocation mechanism that is as fair as possible.

Recently added features

  • NonPreemptible mechanism: Users can mark Pods as non-preemptible to ensure that their resource usage does not exceed the minimum limit (min).
  • Multi Quota Tree: Introduced via ElasticQuotaProfile, allowing the construction of new Quota Trees, defining the maximum resource usage for each tree
  • New statistical dimensions: New Guarantee and Allocated statistical methods to better support elastic resource requirements.

Problems to be solved and optimization directions

  1. Configurability of fairness mechanism [proposal] supports disable runtime quota #1780 implementby scheduler : supports disable runtime quota #1839 scheduler: ElasticQuota runtime is no longer calculated when not needed #1855

    • The fairness mechanism that is turned on by default cannot be turned off at present, and options need to be provided to adapt to special needs such as job scenarios. For example, in the Job scenario, due to the fairness mechanism, multiple jobs may not be able to obtain quotas for the resources that were originally used by one Job due to the fairness mechanism.
  2. Quota Tree integration

    • Integrate Multi Quota Tree with the global default Quota Tree to ensure consistency and simplify management. The global default Quota Tree is just a special case of MultiQuotaTree, which contains the resources of all nodes in the cluster.
  3. Make resource request verification clear

    • Clarify the verification logic between Pod resource requests and ElasticQuota boundaries for different scenarios. In the early design, you only need to pay attention to whether the sum of Pod Request and Used is less than runtime and max. The runtime may be greater than or equal to min; the NonPreemptible mechanism was introduced later. The upper bound of the available quota of non-preemptible Pods is min, and the upper bound of the available quota of preemptible Pods is runtime and max. There are some more complex scenarios here, such as Non-preemptible Pods and preemptible Pods can be counted used together, or statistics need to be allocated separately; MultiQuotaTree has introduced two dimensions: Guarantee and Allocated. These two newly added statistical dimensions solve resource reservations in on-demand elastic scenarios. problem, but also affects the upper bound of the Pod's available resources. Therefore, we need to clarify the verification methods for these different scenarios to ensure they are interpretable.
  4. Remove Special Quota

    • Remove Default Quota and System Quota, replace them with APIs and common capabilities, and automatically inject QuotaName for Pods that do not specify QuotaName. There is special logic regarding these two Quotas in the current design and implementation. On the whole, this exception is unreasonable. For example, the scheduler will specifically create these two ElasticQuota objects at startup. These two objects also need to be additionally considered when calculating fairness. It also allows Pods that do not declare associated Quota to use DefaultQuota by default, and specially writes some code (such as migrateDefaultQuotaGroupsPod) is responsible for revising the status of DefaultQuota. From another perspective, these special cases can be expressed as APIs and used as a general capability; for Pods that do not declare a QuotaName, we can inject a quota name into the pod based on the ClusterColocationProfile mechanism, and the Quota pointed to is actually a DefaultQuota.
  5. Preemption strategy enhancement [proposal] elastic quota plugin support job-level preemption #1840 proposal: support Job-level preemption #1879

    • Supports job granularity-based preemption mechanism to optimize resource allocation.

Code quality improvements

  • Improve the readability and maintainability of ElasticQuota's core code and other parts to support the continued healthy development of the project.
@eahydra eahydra added kind/proposal Create a report to help us improve enhancement New feature or request area/koord-scheduler labels Jan 13, 2024
Copy link

stale bot commented Apr 30, 2024

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close
    Thank you for your contributions.

Copy link

stale bot commented May 30, 2024

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Reopen this PR with /reopen
    Thank you for your contributions.

@stale stale bot closed this as completed May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koord-scheduler enhancement New feature or request kind/proposal Create a report to help us improve lifecycle/stale
Projects
None yet
Development

No branches or pull requests

1 participant