Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't ignore nil devices in plugin fingerprint #9311

Merged
merged 1 commit into from
Nov 11, 2020

Conversation

jeromegn
Copy link
Contributor

Even if a plugin sends back an empty []*device.DeviceGroup, it's transformed to nil during the RPC between nomad and the plugin. This conditional block prevented the manager to trigger the logic for "first fingerprint received". This adds a lot of latency when starting nomad.

Our custom device plugin is returning an empty FingerprintResponse.Devices slice very often. Our temporary fix is to send a ][]*DeviceGroup with a "dummy" DeviceGroup if the slice is empty.

Not triggering the "first fingerprint received" adds 50s to nomad's startup time (because of the batch fingerprint timeout of 50s). In turn, this made our node exceed its hearbeat grace period with our leader when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because their tokens couldn't be renewed.

Removing the logic for f.Devices == nil does not appear to affect the functionality of the function.

Even if a plugin sends back an empty `[]*device.DeviceGroup`, it's transformed to `nil` during the RPC. Our custom device plugin is returning empty `FingerprintResponse.Devices` very often. Our temporary fix is to send a dummy `*DeviceGroup` if the slice is empty. This has the effect of never triggering the "first fingerprint" and therefore timing out after 50s.

In turn, this made our node exceed its hearbeat grace period when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because the token couldn't be renewed.

Removing the logic for `f.Devices == nil` does not appear to affect the functionality of the function.
@jeromegn
Copy link
Contributor Author

I don't know why that check is failing, I can't replicate it locally.

@cgbaker cgbaker added this to the 1.0 milestone Nov 11, 2020
Copy link
Contributor

@cgbaker cgbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromegn , wonderful PR. i personally ran into this problem while doing some device driver development.

@cgbaker
Copy link
Contributor

cgbaker commented Nov 11, 2020

have manually tested this with a device plugin that initially returns empty list, then adds other devices on later fingerprinting. everything works as expected, no surprises.

@cgbaker cgbaker merged commit 1df408d into hashicorp:master Nov 11, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants