You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.
I'm using the SDK (v0.8.0) to spin-up an AZTK cluster. I'm also using a custom docker image, and on one instance I forgot to pass the docker registry credentials, which led to all node start tasks failing.
I would expect that in this instance, wait_until_cluster_is_ready should timeout after failing to bring up a master node after WAIT_FOR_MASTER_TIMEOUT seconds, or notice that the master start task failed. Unfortunately, this does not happen and cluster spin-up hangs indefinitely.
Presumably this is because this loop never terminates, as this line is always run. Maybe if the master start task fails, a master_node_id is never given to the cluster, so it gets stuck there?
Any idea if this is the case? Thank you for the help.
The text was updated successfully, but these errors were encountered:
you are correct that if all start tasks fail early enough that a master will never be elected (so no master_node_id will be set), and that loop will hang. I think the best solution here might be to check if all nodes have entered StartTaskFailed, and exit. Adding a timeout is another good option.
Hello,
I'm using the SDK (v0.8.0) to spin-up an AZTK cluster. I'm also using a custom docker image, and on one instance I forgot to pass the docker registry credentials, which led to all node start tasks failing.
I would expect that in this instance, wait_until_cluster_is_ready should timeout after failing to bring up a master node after WAIT_FOR_MASTER_TIMEOUT seconds, or notice that the master start task failed. Unfortunately, this does not happen and cluster spin-up hangs indefinitely.
Presumably this is because this loop never terminates, as this line is always run. Maybe if the master start task fails, a master_node_id is never given to the cluster, so it gets stuck there?
Any idea if this is the case? Thank you for the help.
The text was updated successfully, but these errors were encountered: