-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault tolerance for actor creation #3422
Fault tolerance for actor creation #3422
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me other than a couple minor comments.
test/component_failures_test.py
Outdated
"num_heartbeats_timeout": 10 | ||
}) | ||
} | ||
# Start with 4 workers and 4 cores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say Start cluster with 4 worker nodes, each with 8 cores.
?
src/ray/raylet/node_manager.cc
Outdated
@@ -541,6 +541,13 @@ void NodeManager::HandleActorStateTransition(const ActorID &actor_id, | |||
<< " already removed from the lineage cache. This is most " | |||
"likely due to reconstruction."; | |||
} | |||
// Maintain the invariant that if a task is in the | |||
// MethodsWaitingForActorCreation queue, then it is subscribed to its | |||
// respective actor creation task. Since the actor location is now known, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invariant is not just that it is subscribed to the its actor creation task, but also that the ONLY task it is subscribed to is its actor creation task, right? Can you add that to the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
src/ray/raylet/node_manager.cc
Outdated
// The actor has not yet been created and may have failed. To make sure | ||
// that the actor is eventually recreated, we maintain the invariant that | ||
// if a task is in the MethodsWaitingForActorCreation queue, then it is | ||
// subscribed to its respective actor creation task. Once the actor has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space -> single space
Test FAILed. |
# to nodes that then failed. | ||
ready, _ = ray.wait( | ||
children_out, num_returns=len(children_out), timeout=30000) | ||
assert len(ready) == len(children_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing this fail with
> assert len(ready) == len(children_out)
E assert 75 == 100
Maybe the timeout is too short? How about we do it without a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I prefer to keep it with a timeout since this test will hang if it doesn't pass. If it hangs, we won't be able to get the stderr. I can increase the timeout to something much longer though.
Test FAILed. |
Test FAILed. |
Test FAILed. |
Both travis 2.7 builds are hung at
? |
I believe those have been hanging on master but I haven't seen the Java
test hang before.
…On Thu, Nov 29, 2018, 10:25 AM Eric Liang ***@***.*** wrote:
Both travis 2.7 builds are hung at
test/actor_test.py::test_actor_multiple_gpus_from_multiple_tasks
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated
?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3422 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACcSBD6VvtmuiePOf7cAyGiZYXg6PFYwks5u0CaPgaJpZM4Y2ht5>
.
|
@stephanie-wang @ericl this was merged, but it's failing the Java tests. Those tests never fail, so it's probably related to this PR. Did you look into this? |
What do these changes do?
Ray used to hang in the following scenario:
The job hangs because reconstruction is never triggered for the actor creation task. This PR fixes the issue by notifying the backend that tasks for actors whose locations are unknown depend on the actor creation task. This will trigger reconstruction if the actor creation task failed.
This PR does not handle suppression for actor creation, which can happen if task lease or actor table notifications are delayed significantly.