Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport 9672 to versions 0.11 & 0.12 #10177

Closed
luckymike opened this issue Mar 13, 2021 · 4 comments
Closed

Backport 9672 to versions 0.11 & 0.12 #10177

luckymike opened this issue Mar 13, 2021 · 4 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/enhancement

Comments

@luckymike
Copy link

Proposal

#9672 fixed a bug that was introduced in 0.11 that caused nodes with no bootstrap_expect or a bootstrap_expect value of 0 to bootstrap as standalone nodes rather than joining a cluster.

In our experience, in 0.11 new instances would successfully join an existing cluster about 30% of the time. In 0.12, our experience was that every node would fail to auto-join a cluster.

Given the potential severity of this and the lack of a correct workaround, I think this fix should be backported to 0.11 & 0.12 if these versions are considered at-all viable to be run in a production setting.

Use-cases

Operators should be able to safely auto-join nodes to an existing cluster by setting bootstrap_expect to 0.

Setting bootstrap_expect to 0 is recommended to avoid potential split-brain scenarios where multiple Nomad clusters register in Consul. As I'm sure you know, this can cause anything from confusion to a major outage. The only workaround for the bug that is fixed in #9672 is to set bootstrap_expect to a higher value, which introduces the risk of a split-brain.

In addition to making a common, intended operational mode unusable, this bug is extremely hard to identify, because Nomad will start without errors on the affected node and register as healthy in Consul. The only indication of a problem will be nodes that have ACLs enabled, because any interaction with the agent will receive a 403, though again, Nomad will register as completely healthy.

Attempted Solutions

The only solutions are to increase bootstrap_expect or to manually join nodes to clusters, both of which degrade (or eliminate) the use of auto-joining nodes to a running cluster.

@notnoop
Copy link
Contributor

notnoop commented Mar 17, 2021

Hi @luckymike! Thanks for raising it. I agree that the issue is severe and warrants a backport. I'll cut a 0.12 release with backporting 9672 later this week.

@notnoop notnoop added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Mar 17, 2021
@notnoop
Copy link
Contributor

notnoop commented Mar 18, 2021

We just shipped 0.12.11 with the fix: https://releases.hashicorp.com/nomad/0.12.11/ , https://github.com/hashicorp/nomad/commits/v0.12.11 . Thank you so much for bringing it to our attention, and I'm glad we could help.

@notnoop notnoop closed this as completed Mar 18, 2021
@luckymike
Copy link
Author

Thanks @notnoop!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/enhancement
Projects
None yet
Development

No branches or pull requests

2 participants