Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerance for actor creation #3422

Merged
merged 6 commits into from
Nov 29, 2018

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

Ray used to hang in the following scenario:

  1. Node N1 forwards an actor creation task to node N2.
  2. N2 dies.
  3. N1 submits an actor task. The location is unknown, so the task gets queued. The actor creation task never gets scheduled, so the task remains queued forever.

The job hangs because reconstruction is never triggered for the actor creation task. This PR fixes the issue by notifying the backend that tasks for actors whose locations are unknown depend on the actor creation task. This will trigger reconstruction if the actor creation task failed.

This PR does not handle suppression for actor creation, which can happen if task lease or actor table notifications are delayed significantly.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me other than a couple minor comments.

"num_heartbeats_timeout": 10
})
}
# Start with 4 workers and 4 cores.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say Start cluster with 4 worker nodes, each with 8 cores.?

@@ -541,6 +541,13 @@ void NodeManager::HandleActorStateTransition(const ActorID &actor_id,
<< " already removed from the lineage cache. This is most "
"likely due to reconstruction.";
}
// Maintain the invariant that if a task is in the
// MethodsWaitingForActorCreation queue, then it is subscribed to its
// respective actor creation task. Since the actor location is now known,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invariant is not just that it is subscribed to the its actor creation task, but also that the ONLY task it is subscribed to is its actor creation task, right? Can you add that to the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

// The actor has not yet been created and may have failed. To make sure
// that the actor is eventually recreated, we maintain the invariant that
// if a task is in the MethodsWaitingForActorCreation queue, then it is
// subscribed to its respective actor creation task. Once the actor has
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space -> single space

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9635/
Test FAILed.

# to nodes that then failed.
ready, _ = ray.wait(
children_out, num_returns=len(children_out), timeout=30000)
assert len(ready) == len(children_out)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this fail with

>               assert len(ready) == len(children_out)
E               assert 75 == 100

Maybe the timeout is too short? How about we do it without a timeout?

Copy link
Contributor Author

@stephanie-wang stephanie-wang Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I prefer to keep it with a timeout since this test will hang if it doesn't pass. If it hangs, we won't be able to get the stderr. I can increase the timeout to something much longer though.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9636/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9638/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9662/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Nov 29, 2018

Both travis 2.7 builds are hung at

test/actor_test.py::test_actor_multiple_gpus_from_multiple_tasks 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

?

@stephanie-wang
Copy link
Contributor Author

stephanie-wang commented Nov 29, 2018 via email

@ericl ericl merged commit 48a5935 into ray-project:master Nov 29, 2018
@robertnishihara robertnishihara deleted the actor-creation-hanging branch November 29, 2018 21:26
@robertnishihara
Copy link
Collaborator

@stephanie-wang @ericl this was merged, but it's failing the Java tests. Those tests never fail, so it's probably related to this PR. Did you look into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants