Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray] Ray execution state #3002

Merged
merged 9 commits into from
May 7, 2022

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented May 6, 2022

What do these changes do?

Mars execution context provides remote object to sync states between multiple running operands. Ray execution backends use a task state actor for this feature.

With this PR, ray backend supports incremental index.

  • Move the logic of initializing context (the ThreadedServiceContext used in the supervisor) to the execution backend.
  • Add a task state actor to Ray execution backend.
  • Ray execution context supports remote object.
  • Not to fetch the chunk meta when tiling HeadOptimizedDataSource.

The drawback is that,

  • Poor performance to sync states among operands, maybe we can avoid using remote object when executing operands in the future.
  • Hard to fault tolerance because the remote object makes the operands stateful.

Related issue number

#2893

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@chaokunyang
Copy link
Contributor

Will this ray actor influence the lineage reconstruction in #2972:

  1. If this actor died, will lineage reconstruction still succeed?
  2. If this actor restarted by ray, how the state auto recovered?
  3. If the lineage reconstruction call this actor multiple times, how the idempotence be ensured?

@fyrestone
Copy link
Contributor Author

Will this ray actor influence the lineage reconstruction in #2972:

  1. If this actor died, will lineage reconstruction still succeed?
  2. If this actor restarted by ray, how the state auto recovered?
  3. If the lineage reconstruction call this actor multiple times, how the idempotence be ensured?

Currently, some operands use a remote object to sync states. If the state actor is reconstructed, the simplest way to recover the compute is,

  1. If call remote object can't find the remote object and the state actor is reconstructed, then raises a RecoveryFailed exception.
  2. The fault recovery logic catch this exception to recompute the predecessors.

We should make the operands stateless to avoid above complex recovery.

This PR is to maximize compatibility with existing Mars execution logic, the fault recovery is not included in this PR.

@fyrestone fyrestone marked this pull request as ready for review May 7, 2022 01:34
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye
Copy link
Collaborator

qinxuye commented May 7, 2022

@chaokunyang could you please review this PR?

@qinxuye qinxuye added this to the v0.9.0rc3 milestone May 7, 2022
@fyrestone fyrestone requested a review from zhongchun May 7, 2022 09:30
Copy link
Contributor

@zhongchun zhongchun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left two comments where i am a little confused.

mars/services/task/execution/ray/context.py Show resolved Hide resolved
mars/dataframe/datasource/core.py Show resolved Hide resolved
Copy link
Contributor

@zhongchun zhongchun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fyrestone fyrestone merged commit 03ed810 into mars-project:master May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants