You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently mars on ray cluster initialization process is as follows:
create RayMainPool actor for supervisor
start oscar service for supervisor
create workers
create ray RayMainPool actor for worker
ray RayMainPool actor create RaySubPool actors
start oscar service for workers
This process has some issues:
When a main pool failed and restarted by ray, it will create new RaySubPool actors rather than restart previous RaySubPool actors, which makes actor management tricky and not easy to track failed process.
Currently ray create main pool first, then main pool create subpools. Because import mars takes 2~3s, create a worker will take double times.
Describe the solution you'd like
It would be better if we can create all ray main/sub pool actors simultaneously, then initialize all mars services in all ray actors. In this way, mars cluster initialization time will be reduced to half of previous.
At the same time, if a ray main pool failed, all started mainpool and subpools will be the same actors as before.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently
mars on ray
cluster initialization process is as follows:RayMainPool
actor for supervisorRayMainPool
actor for workerRayMainPool
actor createRaySubPool
actorsThis process has some issues:
RaySubPool
actors rather than restart previousRaySubPool
actors, which makes actor management tricky and not easy to track failed process.import mars
takes 2~3s, create a worker will take double times.Describe the solution you'd like
It would be better if we can create all ray main/sub pool actors simultaneously, then initialize all mars services in all ray actors. In this way, mars cluster initialization time will be reduced to half of previous.
At the same time, if a ray main pool failed, all started mainpool and subpools will be the same actors as before.
The text was updated successfully, but these errors were encountered: