Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray] Support basic subtask retry and lineage reconstruction #2969

Merged
merged 7 commits into from
May 20, 2022

Conversation

chaokunyang
Copy link
Contributor

@chaokunyang chaokunyang commented Apr 26, 2022

What do these changes do?

This PR implements basic subtask retry and reconstruction based on ray object lineage reconstruction.

  • reuse subtask_retry_times in scheduling config for ray task retry config.
  • Use max_retries in ray task to enable subtask retry and reconstruction.
  • Support fetch tileable chunks meta without accessing storage api.

With this PR, if task lineage are not evicted, the mars tasks will succeed even some ray nodes has failed.

Related issue number

Closes #3029
#2972

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@chaokunyang chaokunyang changed the title [WIP] set subtask_max_retries for ray task [WIP] Support ray task retry Apr 26, 2022
@chaokunyang
Copy link
Contributor Author

Tests for ray task retry need to be added.

@chaokunyang
Copy link
Contributor Author

chaokunyang commented Apr 26, 2022

subtask_max_retries in task_executor_config is duplicate with scheduling.subtask_max_reschedules, should we unify both? @fyrestone @qinxuye @wjsi @hekaisheng @zhongchun

@chaokunyang
Copy link
Contributor Author

Wait #2968 to get merged, then move part of scheduling config to executor

@chaokunyang chaokunyang force-pushed the support_ray_task_retry branch from 3e073d9 to 3163a46 Compare May 12, 2022 11:25
@chaokunyang chaokunyang changed the title [WIP] Support ray task retry [Ray] Support basic subtask retry and lineage reconstruction May 12, 2022
@chaokunyang chaokunyang force-pushed the support_ray_task_retry branch from 3163a46 to ded20a3 Compare May 12, 2022 12:08
@chaokunyang chaokunyang force-pushed the support_ray_task_retry branch from ded20a3 to 028d965 Compare May 12, 2022 12:19
@chaokunyang chaokunyang force-pushed the support_ray_task_retry branch from 02fa85b to 7f1da23 Compare May 12, 2022 15:10
@chaokunyang chaokunyang requested a review from zhongchun May 13, 2022 02:48
mars/services/task/execution/ray/executor.py Outdated Show resolved Hide resolved
mars/deploy/oscar/session.py Outdated Show resolved Hide resolved
@chaokunyang chaokunyang requested a review from a team as a code owner May 13, 2022 07:53
mars/services/task/execution/ray/config.py Outdated Show resolved Hide resolved
mars/deploy/oscar/session.py Outdated Show resolved Hide resolved
mars/deploy/oscar/session.py Outdated Show resolved Hide resolved
@qinxuye qinxuye added this to the v0.10.0a1 milestone May 17, 2022
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

mars/deploy/oscar/session.py Outdated Show resolved Hide resolved
mars/deploy/oscar/session.py Show resolved Hide resolved
@chaokunyang chaokunyang force-pushed the support_ray_task_retry branch from 54749fd to 55ef24e Compare May 17, 2022 09:30
@zhongchun
Copy link
Contributor

subtask_max_retries in task_executor_config is duplicate with scheduling.subtask_max_reschedules, should we unify both? @fyrestone @qinxuye @wjsi @hekaisheng @zhongchun

Yes, there are some configurations like subtask_max_retries, subtask_cancel_timeout which are in scheduling config while there is no scheduling of ray backend. Maybe we can reuse them by adding them to ray task executor configs when running on ray.

Copy link
Contributor

@fyrestone fyrestone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 61c0c51 into mars-project:master May 20, 2022
@qinxuye qinxuye added the to be backported Indicate that the PR need to be backported to stable branch label May 30, 2022
qinxuye pushed a commit to qinxuye/mars that referenced this pull request May 30, 2022
@wjsi wjsi added backported already PR has been backported and removed to be backported Indicate that the PR need to be backported to stable branch labels Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement basic subtask retry and reconstruction
5 participants