[Core] Reconstruct actor to run lineage reconstruction triggered actor task #47396

jjyao · 2024-08-29T02:04:00Z

Why are these changes needed?

Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects.

In terms of the state machine, we add a state transition from DEAD to RESTARTING.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

This reverts commit da00fc5.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…or task" This reverts commit 8d751c5.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567

Entire PR was reviewed privately

With #47396, now Ray Core can automatically restarts an actor when needing to resubmit tasks. This doesn't work with `ray.kill`-ed actors. This PR removes ray.kill and let ref counting garbage-collect the actors. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

…r task (ray-project#47396) Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects. In terms of the state machine, we add a state transition from DEAD to RESTARTING. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

With ray-project#47396, now Ray Core can automatically restarts an actor when needing to resubmit tasks. This doesn't work with `ray.kill`-ed actors. This PR removes ray.kill and let ref counting garbage-collect the actors. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

…r task (ray-project#47396) Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects. In terms of the state machine, we add a state transition from DEAD to RESTARTING. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

Change WaitForActorOutOfScope from long polling to push

da00fc5

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao added the go add ONLY when ready to merge, run all tests label Aug 29, 2024

jjyao added 5 commits September 9, 2024 04:31

Revert "Change WaitForActorOutOfScope from long polling to push"

a71d0ca

This reverts commit da00fc5.

Merge branch 'master' of github.com:ray-project/ray into jjyao/pollling

0b5396b

Reconstruct actor to run lineage reconstruction triggered actor task

8d751c5

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Revert "Reconstruct actor to run lineage reconstruction triggered act…

cc7ab4c

…or task" This reverts commit 8d751c5.

up

06bee5e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao changed the title ~~[Core] Change WaitForActorOutOfScope from long polling to push~~ [Core] Reconstruct actor to run lineage reconstruction triggered actor task Sep 9, 2024

jjyao marked this pull request as ready for review September 9, 2024 17:02

jjyao requested review from a team, ericl, pcmoritz and raulchen as code owners September 9, 2024 17:02

jjyao assigned rkooo567 Sep 9, 2024

jjyao added 2 commits September 9, 2024 10:25

Merge branch 'master' of github.com:ray-project/ray into jjyao/pollling

74b3cca

Fix merge conflicts

9bcf56f

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567 approved these changes Sep 9, 2024

View reviewed changes

jjyao merged commit 0773760 into ray-project:master Sep 10, 2024
4 of 5 checks passed

jjyao deleted the jjyao/pollling branch September 10, 2024 00:39

raulchen mentioned this pull request Sep 19, 2024

[data] Remove ray.kill in ActorPoolMapOperator #47752

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Reconstruct actor to run lineage reconstruction triggered actor task #47396

[Core] Reconstruct actor to run lineage reconstruction triggered actor task #47396

jjyao commented Aug 29, 2024 •

edited

Loading

rkooo567 left a comment

[Core] Reconstruct actor to run lineage reconstruction triggered actor task #47396

[Core] Reconstruct actor to run lineage reconstruction triggered actor task #47396

Conversation

jjyao commented Aug 29, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 left a comment

Choose a reason for hiding this comment

jjyao commented Aug 29, 2024 •

edited

Loading