Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate autopilot implementation to raft-autopilot #14441

Merged
merged 3 commits into from
Sep 1, 2022
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Sep 1, 2022

Nomad's original autopilot was importing from a private package in Consul. It
has been moved out to a shared library. Switch Nomad to use this library so that
we can eliminate the import of Consul, which is necessary to build Nomad ENT
with the current version of the Consul SDK. This also will let us pick up
autopilot improvements shared with Consul more easily.

Fixes #9570 (associated with https://github.com/hashicorp/nomad-enterprise/pull/836)

Nomad's original autopilot was importing from a private package in Consul. It
has been moved out to a shared library. Switch Nomad to use this library so that
we can eliminate the import of Consul, which is necessary to build Nomad ENT
with the current version of the Consul SDK. This also will let us pick up
autopilot improvements shared with Consul more easily.
@tgross
Copy link
Member Author

tgross commented Sep 1, 2022

I was hoping this would fix the test flakes in #13931 and #13930 but given the behavior isn't supposed to have changed I guess I see why it doesn't. I'll pick those up as part of improving our E2E story for autopilot before 1.4.0-GA.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! It may be good to test (if you haven't yet) interoperability between servers using the old Consul autopilot and this one, like it would happen during a rolling cluster upgrade.

@@ -40,7 +40,6 @@ require (
github.com/gorilla/websocket v1.5.0
github.com/gosuri/uilive v0.0.4
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0
github.com/hashicorp/consul v1.7.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

nomad/autopilot.go Outdated Show resolved Hide resolved
// immediately so we'll spawn a goroutine for it.)
func (d *AutopilotDelegate) RemoveFailedServer(failedSrv *autopilot.Server) {
go func() {
err := d.server.RemoveFailedNode(failedSrv.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Name here the agent's name? Would it be possible to remove by ID instead in case the cluster has servers with duplicate names (which I think is allow, though not a good idea 😅)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the serf.Member.Name, which I think is the same thing. I originally had this by ID (which is what Consul does) but our existing RemoveFailedNode code only accepts the name and not the ID. I didn't want to change the behavior here.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -40,7 +40,6 @@ require (
github.com/gorilla/websocket v1.5.0
github.com/gosuri/uilive v0.0.4
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0
github.com/hashicorp/consul v1.7.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

nomad/autopilot.go Outdated Show resolved Hide resolved
Comment on lines +8 to 9
// TODO: replace this with our own helper
"github.com/hashicorp/consul/sdk/testutil/retry"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's on my list 😞

@vercel
Copy link

vercel bot commented Sep 1, 2022

Deployment failed with the following error:

Invalid website/vercel.json file provided

@vercel
Copy link

vercel bot commented Sep 1, 2022

Deployment failed with the following error:

Invalid ui/vercel.json file provided

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adopt raft-autopilot library
3 participants