client: wait for batched driver updates before registering nodes #5585

notnoop · 2019-04-19T13:24:02Z

Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout. This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second. We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes #5579

While digging here, I made few changes that would have helped me debug the problem quicker:

Removed watchNodeUpdates() almost immediately after registerAndHeartbeat() calls retryRegisterNode(), well after 5 seconds.
ensure that detected drivers show their status and don't mistake nonfingerprinted drivers as available ones.

Here we retain 0.8.7 behavior of waiting for driver fingerprints before registering a node, with some timeout. This is needed for system jobs, as system job scheduling for node occur at node registration, and the race might mean that a system job may not get placed on the node because of missing drivers. The timeout isn't strictly necessary, but raising it to 1 minute as it's closer to indefinitely blocked than 1 second. We need to keep the value high enough to capture as much drivers/devices, but low enough that doesn't risk blocking too long due to misbehaving plugin. Fixes #5579

I noticed that `watchNodeUpdates()` almost immediately after `registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5 seconds. This call is unnecessary and made debugging a bit harder. So here, we ensure that we only re-register node for new node events, not for initial registration.

Noticed that `detected drivers` log line was misleading - when a driver doesn't fingerprint before timeout, their health status is empty string `""` which we would mark as detected. Now, we log all drivers along with their state to ease driver fingerprint debugging.

nickethier

LGTM

client/client.go

github-actions · 2023-02-12T02:17:33Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 3 commits April 19, 2019 09:00

nickethier approved these changes Apr 19, 2019

View reviewed changes

client/client.go Outdated Show resolved Hide resolved

clarify cryptic log line

8041b0c

notnoop merged commit 9050f5f into master Apr 19, 2019

notnoop deleted the b-drivers-node-registration branch April 19, 2019 13:47

github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: wait for batched driver updates before registering nodes #5585

client: wait for batched driver updates before registering nodes #5585

notnoop commented Apr 19, 2019

nickethier left a comment

github-actions bot commented Feb 12, 2023

client: wait for batched driver updates before registering nodes #5585

client: wait for batched driver updates before registering nodes #5585

Conversation

notnoop commented Apr 19, 2019

nickethier left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 12, 2023