Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray] Fix ray worker failover #3080

Conversation

chaokunyang
Copy link
Contributor

What do these changes do?

In ray master, If a actor created with max_restarts=-1 is restarting, call actor method will raise exception instead of pending in caller. This PR fix it by specifing max_retries=-1 when querying actor state.

Related issue number

Fixes #3079

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@chaokunyang chaokunyang requested a review from a team as a code owner May 24, 2022 08:19
@chaokunyang chaokunyang force-pushed the support_worker_failover_for_ray_master branch 12 times, most recently from 25acb25 to e7ecf8b Compare May 28, 2022 03:06
@chaokunyang chaokunyang changed the title [Ray] support ray worker failover for ray master [Ray] Fix ray worker failover May 28, 2022
@chaokunyang chaokunyang force-pushed the support_worker_failover_for_ray_master branch from 787dec6 to f6f33ca Compare May 28, 2022 05:34
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye added type: bug Something isn't working mod: actor to be backported Indicate that the PR need to be backported to stable branch mod: ray integration labels May 28, 2022
@qinxuye qinxuye added this to the v0.10.0a1 milestone May 28, 2022
Copy link
Contributor

@zhongchun zhongchun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@chaokunyang chaokunyang merged commit 0263954 into mars-project:master May 28, 2022
qinxuye pushed a commit to qinxuye/mars that referenced this pull request Jun 6, 2022
* make failover work with laster ray master

* fix max_task_retries

* fix _get_actor

* fix compatibility

* fix retry actor state task

* fix subppol restart

* skip test_ownership_when_scale_in

* revert alive check interval

* lint

* lint

(cherry picked from commit 0263954)
@qinxuye qinxuye added backported already PR has been backported and removed to be backported Indicate that the PR need to be backported to stable branch labels Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported already PR has been backported mod: actor mod: ray integration type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Mars worker not recovered on ray master
3 participants