Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray] new worker create model #2782

Closed
chaokunyang opened this issue Mar 3, 2022 · 0 comments · Fixed by #2783
Closed

[ray] new worker create model #2782

chaokunyang opened this issue Mar 3, 2022 · 0 comments · Fixed by #2783

Comments

@chaokunyang
Copy link
Contributor

chaokunyang commented Mar 3, 2022

Is your feature request related to a problem? Please describe.
Currently mars on ray cluster initialization process is as follows:

  • create RayMainPool actor for supervisor
  • start oscar service for supervisor
  • create workers
    • create ray RayMainPool actor for worker
    • ray RayMainPool actor create RaySubPool actors
  • start oscar service for workers

This process has some issues:

  • When a main pool failed and restarted by ray, it will create new RaySubPool actors rather than restart previous RaySubPool actors, which makes actor management tricky and not easy to track failed process.
  • Currently ray create main pool first, then main pool create subpools. Because import mars takes 2~3s, create a worker will take double times.

Describe the solution you'd like
It would be better if we can create all ray main/sub pool actors simultaneously, then initialize all mars services in all ray actors. In this way, mars cluster initialization time will be reduced to half of previous.
At the same time, if a ray main pool failed, all started mainpool and subpools will be the same actors as before.

@qinxuye qinxuye added this to the v0.9.0b2 milestone Mar 3, 2022
@qinxuye qinxuye removed this from the v0.9.0b2 milestone Mar 4, 2022
@wjsi wjsi closed this as completed in #2783 Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants