Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FailoverHeartbeatTTL to config #11127

Merged
merged 3 commits into from
Oct 6, 2021

Conversation

mukerjee
Copy link
Contributor

@mukerjee mukerjee commented Sep 2, 2021

FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.

In our use case we have a small number of machines (e.g., 7) in the same physical rack, connected with redundant networking (multiple NICs, multiple switches). It is prohibitively expensive for us to dedicate machines to being only nomad servers (which would make FailoverHeartbeatTTL less impactful). In this use case, if heartbeats haven't been responded to within e.g., 30s, the machine is almost definitely failed in some way. No need to wait for 5m.

This relates to #1747 where it was requested that MinHeartbeatTTL and FailoverHeartbeatTTL would be made configurable. Since then, MinHeartbeatTTL has already been made configurable. This PR makes FailoverHeartbeatTTL configurable.

FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
@hashicorp-cla
Copy link

hashicorp-cla commented Sep 2, 2021

CLA assistant check
All committers have signed the CLA.

@mukerjee
Copy link
Contributor Author

Bump. Any chance this can be looked at? This would be very useful for my use case.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work @mukerjee! And apologies for the delay on getting it reviewed.

I pushed a commit with a changelog entry and to highlight the potential risks with modifying this configuration.

@lgfa29 lgfa29 added this to the 1.2.0 milestone Oct 6, 2021
@lgfa29 lgfa29 merged commit 0881b94 into hashicorp:main Oct 6, 2021
@mukerjee
Copy link
Contributor Author

mukerjee commented Oct 6, 2021

Excellent! Thank you @lgfa29 ! No worries about the delay.

I see this is marked for v1.2.0. Any timeframe for that release yet?

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 7, 2021

Excellent! Thank you @lgfa29 ! No worries about the delay.

I see this is marked for v1.2.0. Any timeframe for that release yet?

No hard dates yet, but soon 🙂

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
@mukerjee mukerjee deleted the failover-heartbeat-ttl-config branch November 14, 2022 06:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants