Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Ray execution backend #2921

Merged
merged 40 commits into from
Apr 24, 2022

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Apr 15, 2022

What do these changes do?

  • Basic ray executor backends which convert Mars subtask graph to Ray DAG to execute.
  • Unify the get chunk params logic to mars.utils.get_chunk_params.
  • Rename the Execution API get_available_band_slots to get_available_band_resources.
  • Change the return type of the Execution API execute_subtask_graph from List[ExecutionChunkResult] to Dict[Chunk, ExecutionChunkResult].
  • Refine Execution API to support meta updating optimization Store complete meta on worker and update supervisor meta via fetching from workers #2912.
    • The Execution API returns a collection of ExecutionChunkResult contains the meta to be updated.
    • The task supervisor updates the meta from the ExecutionChunkResult.
  • New Execution API Fetcher to fetch data from different execution backends.
  • Refine the session backend.
    • Before - The backend is bound to the session cls.
      • oscar - _IsolatedSession
      • test - CheckedSession
    • After - The backend is for the execution backend.
      • mars - Mars, Mars on Ray
      • ray - Mars on Ray DAG
  • Pin xgboost < 1.6.0 because the latest xgboost breaks the API.

Related issue number

#2893

Fixes #xxxx

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@fyrestone fyrestone self-assigned this Apr 15, 2022
@fyrestone fyrestone marked this pull request as ready for review April 20, 2022 04:32
# client will go to ray server first, then the ray server will ray call to other actors. So the ray server need to
# register ray serializers.
# TODO Need a way to check whether current process is a ray server.
register_ray_serializers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ray client server doesn't register_ray_serializers, mars objects serialization may get error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially for ray ObjectRef and ActorHandle, since thay are serialized by pickler

Copy link
Contributor Author

@fyrestone fyrestone Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ray client server doesn't register_ray_serializers, mars objects serialization may get error?

Currently, there is no case covers this code. I believe these lines fix some issues, but current implementation is too rough.

I think we can find a better solution after this PR. For example,

  • register_ray_serializers() only when necessary.
  • Add some test cases cover this fix.

Copy link
Contributor

@chaokunyang chaokunyang Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ray client server is started at a different process from driver, skip register_ray_serializers in client will cause serializatin issue. Curretn test case test_ray.py::test_ray_client didn't start a ray client server in another process, this is why removing it didn't throws error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip register_ray_serializers issue in this PR and fix it in next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will create a new PR with fixes and test cases.

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only have a doubt about backend and execution_backend, is it possible to unify them? Other parts look good to me.

) -> ClientType:
backend = backend or "oscar"
session = await _new_session(
cluster.external_address, backend=backend, default=True, timeout=timeout
cluster.external_address,
backend=backend,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's weird to see the backend here actually takes no effect, both Mars & ray pass the backend with value oscar, I wonder if we can unify the backend and execution_backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, We can resuse the backend for ray execution backend? e.g.

  • backend == 'oscar' for Mars & Mars on Ray
  • backend == 'ray' for Mars on Ray DAG

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just name mars for mars, ray for mars on ray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the backend is bound to the session cls. The execution backend only for the execution implementation. I am not sure if it is a good idea to mix them to one.

Currenty,

  • backend
    • oscar - _IsolatedSession
    • test - CheckedSession
  • execution backend
    • mars - Mars, Mars on Ray
    • ray - Mars on Ray DAG

I can mix them to one,

  • backend
    • mars - Mars, Mars on Ray
    • ray - Mars on Ray DAG

The backend will never bounds to the session cls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ok, backend for different backend not for session cls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will push a commit to fix this. Thanks.

Copy link
Contributor

@hekaisheng hekaisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two questions.

mars/deploy/oscar/session.py Show resolved Hide resolved
mars/services/task/supervisor/processor.py Show resolved Hide resolved
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@hekaisheng hekaisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hekaisheng hekaisheng merged commit 9cfcc94 into mars-project:master Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants