Fault tolerance for actor creation #3422

stephanie-wang · 2018-11-27T23:11:02Z

What do these changes do?

Ray used to hang in the following scenario:

Node N1 forwards an actor creation task to node N2.
N2 dies.
N1 submits an actor task. The location is unknown, so the task gets queued. The actor creation task never gets scheduled, so the task remains queued forever.

The job hangs because reconstruction is never triggered for the actor creation task. This PR fixes the issue by notifying the backend that tasks for actors whose locations are unknown depend on the actor creation task. This will trigger reconstruction if the actor creation task failed.

This PR does not handle suppression for actor creation, which can happen if task lease or actor table notifications are delayed significantly.

robertnishihara

Looks good to me other than a couple minor comments.

robertnishihara · 2018-11-27T23:21:01Z

test/component_failures_test.py

+            "num_heartbeats_timeout": 10
+        })
+    }
+    # Start with 4 workers and 4 cores.


Should this say Start cluster with 4 worker nodes, each with 8 cores.?

robertnishihara · 2018-11-27T23:25:22Z

src/ray/raylet/node_manager.cc

@@ -541,6 +541,13 @@ void NodeManager::HandleActorStateTransition(const ActorID &actor_id,
                         << " already removed from the lineage cache. This is most "
                            "likely due to reconstruction.";
      }
+      // Maintain the invariant that if a task is in the
+      // MethodsWaitingForActorCreation queue, then it is subscribed to its
+      // respective actor creation task. Since the actor location is now known,


The invariant is not just that it is subscribed to the its actor creation task, but also that the ONLY task it is subscribed to is its actor creation task, right? Can you add that to the comment?

Yes, thanks!

robertnishihara · 2018-11-27T23:25:56Z

src/ray/raylet/node_manager.cc

+      // The actor has not yet been created and may have failed. To make sure
+      // that the actor is eventually recreated, we maintain the invariant that
+      // if a task is in the MethodsWaitingForActorCreation queue, then it is
+      // subscribed to its respective actor creation task.  Once the actor has


Double space -> single space

AmplabJenkins · 2018-11-28T00:43:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9635/
Test FAILed.

robertnishihara · 2018-11-28T00:53:48Z

test/component_failures_test.py

+            # to nodes that then failed.
+            ready, _ = ray.wait(
+                children_out, num_returns=len(children_out), timeout=30000)
+            assert len(ready) == len(children_out)


I'm seeing this fail with

> assert len(ready) == len(children_out) E assert 75 == 100

Maybe the timeout is too short? How about we do it without a timeout?

Hmm I prefer to keep it with a timeout since this test will hang if it doesn't pass. If it hangs, we won't be able to get the stderr. I can increase the timeout to something much longer though.

AmplabJenkins · 2018-11-28T02:42:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9636/
Test FAILed.

AmplabJenkins · 2018-11-28T06:08:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9638/
Test FAILed.

AmplabJenkins · 2018-11-29T10:40:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9662/
Test FAILed.

ericl · 2018-11-29T18:25:14Z

Both travis 2.7 builds are hung at

test/actor_test.py::test_actor_multiple_gpus_from_multiple_tasks 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

?

stephanie-wang · 2018-11-29T18:39:12Z

I believe those have been hanging on master but I haven't seen the Java test hang before.

…

On Thu, Nov 29, 2018, 10:25 AM Eric Liang ***@***.*** wrote: Both travis 2.7 builds are hung at test/actor_test.py::test_actor_multiple_gpus_from_multiple_tasks No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received The build has been terminated ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3422 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACcSBD6VvtmuiePOf7cAyGiZYXg6PFYwks5u0CaPgaJpZM4Y2ht5> .

robertnishihara · 2018-11-29T21:27:01Z

@stephanie-wang @ericl this was merged, but it's failing the Java tests. Those tests never fail, so it's probably related to this PR. Did you look into this?

stephanie-wang added 3 commits November 27, 2018 14:58

Add regression test

e67bdf7

Request actor creation if no actor location found

e40add9

Comments

608816e

robertnishihara reviewed Nov 27, 2018

View reviewed changes

Address comments

a5e0055

robertnishihara reviewed Nov 28, 2018

View reviewed changes

Increase test timeout

b9506c0

ericl added the release-blocker label Nov 29, 2018

ericl assigned robertnishihara Nov 29, 2018

Trigger test

06ac02a

ericl approved these changes Nov 29, 2018

View reviewed changes

ericl merged commit 48a5935 into ray-project:master Nov 29, 2018

robertnishihara deleted the actor-creation-hanging branch November 29, 2018 21:26

robertnishihara mentioned this pull request Nov 29, 2018

Java tests are failing. #3434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerance for actor creation #3422

Fault tolerance for actor creation #3422

stephanie-wang commented Nov 27, 2018

robertnishihara left a comment

robertnishihara Nov 27, 2018

robertnishihara Nov 27, 2018

stephanie-wang Nov 27, 2018

robertnishihara Nov 27, 2018

AmplabJenkins commented Nov 28, 2018

robertnishihara Nov 28, 2018

stephanie-wang Nov 28, 2018 •

edited

Loading

AmplabJenkins commented Nov 28, 2018

AmplabJenkins commented Nov 28, 2018

AmplabJenkins commented Nov 29, 2018

ericl commented Nov 29, 2018

stephanie-wang commented Nov 29, 2018 via email

robertnishihara commented Nov 29, 2018

Fault tolerance for actor creation #3422

Fault tolerance for actor creation #3422

Conversation

stephanie-wang commented Nov 27, 2018

What do these changes do?

robertnishihara left a comment

Choose a reason for hiding this comment

robertnishihara Nov 27, 2018

Choose a reason for hiding this comment

robertnishihara Nov 27, 2018

Choose a reason for hiding this comment

stephanie-wang Nov 27, 2018

Choose a reason for hiding this comment

robertnishihara Nov 27, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Nov 28, 2018

robertnishihara Nov 28, 2018

Choose a reason for hiding this comment

stephanie-wang Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

AmplabJenkins commented Nov 28, 2018

AmplabJenkins commented Nov 28, 2018

AmplabJenkins commented Nov 29, 2018

ericl commented Nov 29, 2018

stephanie-wang commented Nov 29, 2018 via email

robertnishihara commented Nov 29, 2018

stephanie-wang Nov 28, 2018 •

edited

Loading