Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scheduling] Expand slot scheduler to resource scheduler #2846

Merged
merged 56 commits into from
Mar 28, 2022

Conversation

zhongchun
Copy link
Contributor

@zhongchun zhongchun commented Mar 22, 2022

What do these changes do?

Currently Mars use slot for resource management and bands allocation which just consider cpu/gpu but no memory. Mars always allocate one slot which represents one core cpu or gpu for a subtask. It works well most time. But there are some shortcomings like:

Subtasks need less cpu but assigned more which results in low cpu utilization and long execution time
Subtasks need more memory and less cpu which leads node OOM
So we could develop more granular resource management and allocation to increase resource utilization, improve scheduling efficiency, and avoid OOM.

Related issue number

Closes #2787

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

mars/services/subtask/core.py Outdated Show resolved Hide resolved
mars/services/cluster/gather.py Outdated Show resolved Hide resolved
mars/deploy/oscar/local.py Outdated Show resolved Hide resolved
mars/deploy/oscar/local.py Outdated Show resolved Hide resolved
mars/deploy/oscar/local.py Outdated Show resolved Hide resolved
mars/deploy/oscar/worker.py Outdated Show resolved Hide resolved
mars/deploy/oscar/worker.py Outdated Show resolved Hide resolved
mars/services/task/supervisor/processor.py Outdated Show resolved Hide resolved
mars/services/cluster/uploader.py Outdated Show resolved Hide resolved
mars/services/task/supervisor/processor.py Outdated Show resolved Hide resolved
.python-version Outdated Show resolved Hide resolved
Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute num_mem_bytes in Resource class shall also be renamed as mem_bytes.

@qinxuye
Copy link
Collaborator

qinxuye commented Mar 26, 2022

I think failure of asv benchmark can be ignored, because the modification, the benchmark cannot run on master branch any more. The time in #2875 is 1.09±0.02s, I think this PR does not make the time worse.

@zhongchun
Copy link
Contributor Author

I think failure of asv benchmark can be ignored, because the modification, the benchmark cannot run on master branch any more. The time in #2875 is 1.09±0.02s, I think this PR does not make the time worse.

I think so. Thanks for your reminder.

Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 0509a44 into mars-project:master Mar 28, 2022
@hekaisheng hekaisheng added the to be backported Indicate that the PR need to be backported to stable branch label Mar 31, 2022
@qinxuye qinxuye removed the to be backported Indicate that the PR need to be backported to stable branch label Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Expand slot management to resource management
5 participants