-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Allow more than 5s from node creation to first heartbeat #3385
Conversation
Test FAILed. |
python/ray/autoscaler/autoscaler.py
Outdated
self.provider.internal_ip(node_id), 0) | ||
key = self.provider.internal_ip(node_id) | ||
if key not in self.load_metrics.last_heartbeat_time_by_ip: | ||
self.load_metrics.last_heartbeat_by_ip = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should read self.load_metrics.last_heartbeat_time_by_ip[key] = time.time()
(attribute name, and you need the subscript)
Test FAILed. |
Test FAILed. |
Test FAILed. |
key = self.provider.internal_ip(node_id) | ||
if key not in self.load_metrics.last_heartbeat_time_by_ip: | ||
self.load_metrics.last_heartbeat_time_by_ip[key] = time.time() | ||
last_heartbeat_time = self.load_metrics.last_heartbeat_time_by_ip[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this used to default to 0
, which means that delta
(below) would be enormous and then the autoscaler would attempt to restart the node, right?
@ericl |
Sure, if those can be surfaced to python.
…On Mon, Nov 26, 2018, 5:26 PM Robert Nishihara ***@***.***> wrote:
@ericl <https://github.com/ericl> monitor.cc already makes decisions
about when a node should be considered dead. Wouldn't it make sense to just
use those decisions? Instead of also having the autoscaler make that
decision?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3385 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SqgOPcrOy6FQy7rw4SjzgnBhpQEsks5uzJSpgaJpZM4Yuzr1>
.
|
What do these changes do?
I think this is due to a race condition between when we first mark a node as active and check for restarts. It's possible a node state changes in the background between these two checks, which would result in a spurious restart.
Related issue number
Closes #3361