Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server 1.3.5 isn't able to join 1.4.0 #14819

Closed
freef4ll opened this issue Oct 6, 2022 · 8 comments · Fixed by #14821
Closed

Nomad server 1.3.5 isn't able to join 1.4.0 #14819

freef4ll opened this issue Oct 6, 2022 · 8 comments · Fixed by #14821
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core theme/crash type/bug
Milestone

Comments

@freef4ll
Copy link

freef4ll commented Oct 6, 2022

Nomad version

Nomad running on Linux is 1.4.0 and another is 1.3.5, which panics upon trying to upgrade:

Operating system and Environment details

1.4.0:

# uname -a
Linux test 5.4.0-126-generic #142-Ubuntu SMP Fri Aug 26 12:15:55 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

1.3.5 from brew:

$ uname -a
Darwin laptop.local 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000 arm64

Issue

Trying to bring up a cluster causes the 1.3.5 to panic:

    2022-10-06T12:05:56.238+0300 [INFO]  nomad.raft: updating configuration: command=AddVoter server-id=259e58a9-7df4-9e68-6217-c0602b5784cf server-addr=192.168.66.7:4647 servers="[{Suffrage:Voter ID:a63aa225-36ec-2644-98b7-dd902f5ccc72 Address:192.168.66.1:4647} {Suffrage:Voter ID:259e58a9-7df4-9e68-6217-c0602b5784cf Address:192.168.66.7:4647}]"
    2022-10-06T12:10:19.447+0300 [WARN]  nomad.raft: failed to contact quorum of nodes, stepping down
    2022-10-06T12:10:19.480+0300 [INFO]  nomad.raft: entering follower state: follower="Node at 192.168.66.1:4647 [Follower]" leader-address= leader-id=
    2022-10-06T12:10:19.480+0300 [INFO]  nomad.raft: aborting pipeline replication: peer="{Nonvoter 259e58a9-7df4-9e68-6217-c0602b5784cf 192.168.66.7:4647}"
    2022-10-06T12:10:19.628+0300 [INFO]  nomad: cluster leadership lost
    2022-10-06T12:10:19.628+0300 [ERROR] worker: failed to dequeue evaluation: worker_id=2f5d738b-bff9-0e16-8a62-9adac20068ac error="eval broker disabled"
    2022-10-06T12:10:19.628+0300 [ERROR] worker: failed to dequeue evaluation: worker_id=69995186-c93e-57c7-7ef3-9934aac44bb0 error="eval broker disabled"
    2022-10-06T12:10:21.570+0300 [INFO]  nomad.raft: duplicate requestVote for same term: term=1562
    2022-10-06T12:10:21.570+0300 [WARN]  nomad.raft: duplicate requestVote from: candidate=192.168.66.7:4647
panic: failed to apply request: []byte{0x33, 0x87, 0xab, 0x52, 0x6f, 0x6f, 0x74, 0x4b, 0x65, 0x79, 0x4d, 0x65, 0x74, 0x61, 0x86, 0xa5, 0x4b, 0x65, 0x79, 0x49, 0x44, 0xda, 0x0, 0x24, 0x37, 0x34, 0x32, 0x33, 0x65, 0x66, 0x63, 0x38, 0x2d, 0x64, 0x35, 0x38, 0x30, 0x2d, 0x65, 0x62, 0x62, 0x65, 0x2d, 0x62, 0x36, 0x33, 0x34, 0x2d, 0x31, 0x31, 0x31, 0x34, 0x32, 0x35, 0x33, 0x31, 0x35, 0x33, 0x34, 0x33, 0xa9, 0x41, 0x6c, 0x67, 0x6f, 0x72, 0x69, 0x74, 0x68, 0x6d, 0xaa, 0x61, 0x65, 0x73, 0x32, 0x35, 0x36, 0x2d, 0x67, 0x63, 0x6d, 0xaa, 0x43, 0x72, 0x65, 0x61, 0x74, 0x65, 0x54, 0x69, 0x6d, 0x65, 0xd3, 0x17, 0x1b, 0x6f, 0xcf, 0x26, 0xb1, 0x64, 0xd2, 0xab, 0x43, 0x72, 0x65, 0x61, 0x74, 0x65, 0x49, 0x6e, 0x64, 0x65, 0x78, 0x0, 0xab, 0x4d, 0x6f, 0x64, 0x69, 0x66, 0x79, 0x49, 0x6e, 0x64, 0x65, 0x78, 0x0, 0xa5, 0x53, 0x74, 0x61, 0x74, 0x65, 0xa6, 0x61, 0x63, 0x74, 0x69, 0x76, 0x65, 0xa5, 0x52, 0x65, 0x6b, 0x65, 0x79, 0xc2, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa0, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xa0, 0xb0, 0x49, 0x64, 0x65, 0x6d, 0x70, 0x6f, 0x74, 0x65, 0x6e, 0x63, 0x79, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xa0, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc2}

goroutine 78 [running]:
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0x140001c6cb0, 0x140016042a0)
        github.com/hashicorp/nomad/nomad/fsm.go:330 +0xe8c
github.com/hashicorp/raft.(*Raft).runFSM.func1(0x14000533930)
        github.com/hashicorp/raft@v1.3.9/fsm.go:98 +0x1c8
github.com/hashicorp/raft.(*Raft).runFSM.func2({0x14000ab2a00, 0x1, 0x40?})
        github.com/hashicorp/raft@v1.3.9/fsm.go:121 +0x40c
github.com/hashicorp/raft.(*Raft).runFSM(0x14000165600)
        github.com/hashicorp/raft@v1.3.9/fsm.go:237 +0x2d0
github.com/hashicorp/raft.(*raftState).goFunc.func1()
        github.com/hashicorp/raft@v1.3.9/state.go:146 +0x5c
created by github.com/hashicorp/raft.(*raftState).goFunc
        github.com/hashicorp/raft@v1.3.9/state.go:144 +0x8c

If 1.3.6 is used, this comes up fine.

@jrasell
Copy link
Member

jrasell commented Oct 6, 2022

Hi @freef4ll and thanks for raising this.

We will look into this straight away but could benefit from a little more detail. Could you share information regarding the upgrade steps you are taking in order to see this panic, with attention taken to which server is the leader at each stage?

If 1.3.6 is used, this comes up fine.

Could you clarify whether this means 1.3.5 and 1.3.6 together is running fine, or whether this is 1.3.6 and 1.4.0?

@freef4ll
Copy link
Author

freef4ll commented Oct 6, 2022

Steps is a new cluster:

  1. Start 1.4.0, single node, bootstrap_expect=1
  2. Start 1.3.5, with start_join towards the 1.4.0. The panic will be experienced.

Could you clarify whether this means 1.3.5 and 1.3.6 together is running fine
Yes, this variation works fine.

The 1.3.5 runs on a Mac, and is installed via brew.

@tgross
Copy link
Member

tgross commented Oct 6, 2022

Hi @freef4ll! Adding an "old" server to a current cluster isn't supported. But I was able to reproduce a crash with a more realistic upgrade scenario:

  • start with a 3-node 1.3.5 cluster
  • upgrade leader to 1.4.0
  • upgrade new leader to 1.4.0
  • stop the remaining 1.3.5 server to force a leader election to one of the 1.4.0 machines (simulating a netsplit or other glitch)
  • restart the 1.3.5 server without upgrading; it will panic

I think we're often at risk of this scenario when adding new raft entries but because the leader transition itself is writing a new kind of raft entry in Nomad 1.4.0, that's why we're seeing this. We're having a chat internally about how we can fix this and we'll circle back shortly.

@jrasell jrasell pinned this issue Oct 6, 2022
@jrasell jrasell added theme/core stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Oct 6, 2022
@tgross
Copy link
Member

tgross commented Oct 6, 2022

Fix is in #14821
We've added a warning to the upgrade guide in #14825

@freef4ll
Copy link
Author

freef4ll commented Oct 6, 2022

@tgross , great thank you!

@freef4ll freef4ll closed this as completed Oct 6, 2022
@tgross
Copy link
Member

tgross commented Oct 6, 2022

We'll leave this issue pinned so that other folks see it at least until 1.4.1 comes out (which should be quickly), and perhaps sometime after so that as well.

@tgross
Copy link
Member

tgross commented Oct 6, 2022

Nomad 1.4.1 has been released with a fix for this issue.

@tgross tgross added this to the 1.4.0 milestone Oct 6, 2022
@nickwales nickwales unpinned this issue Oct 19, 2022
@nickwales nickwales pinned this issue Oct 19, 2022
@nickwales nickwales unpinned this issue Oct 27, 2022
@github-actions
Copy link

github-actions bot commented Feb 4, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core theme/crash type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants