Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not upgrading servers across all regions to 1.4 causes "keyring has not been initialized yet" #14896

Closed
optiz0r opened this issue Oct 15, 2022 · 4 comments · Fixed by #14901
Closed
Assignees
Milestone

Comments

@optiz0r
Copy link
Contributor

optiz0r commented Oct 15, 2022

Nomad version

Main region: 1.4.1
Alternate region: 1.3.3

Issue

After upgrading 3x servers in the main region (which included the leader node) from 1.3.5 to 1.4.1, it was impossible to schedule any new jobs until the a second region containing a single server was also upgraded from 1.3.3 to 1.4.1. All evals failed with error "keyring has not been initialized yet". As soon as the alternate region server was upgraded, jobs in the main region started to be allocated once more.

Reproduction steps

  • Upgrade region containing leader form 1.3.5 to 1.4.1, leaving a server in a second region at 1.3.3.
  • Attempt to schedule any new jobs

Expected Result

Jobs to continue to be scheduled in the main cluster once all servers had been upgraded to 1.4.1

Actual Result

Cluster was in-operable until all servers across all regions were upgraded.

Possible remedies

The upgrade-specific instructions hint that there could be problems if the leader is allowed to rollback to an older cluster version, but not that servers across all regions need to be upgraded simultaneously in order to maintain cluster function.
Either the upgrade nodes should call out this requirement, or the leader should be able to handle servers in another region still running on an older version. Typically I would only upgrade one region at a time.

Nomad Server logs

Oct 14 21:58:31 leader.example.com nomad[276724]:     2022-10-14T21:58:31.265Z [ERROR] worker: error invoking scheduler: worker_id=4ba20503-dafb-ee08-c327-7e0d1779e4ae error="failed to process evaluation: rpc error: keyring has not been initialized yet"
Oct 14 21:58:31 leader.example.com nomad[276724]:     2022-10-14T21:58:31.725Z [ERROR] worker: failed to submit plan for evaluation: worker_id=4ba20503-dafb-ee08-c327-7e0d1779e4ae eval_id=6c8ce736-2a39-836b-1d83-e51714187792 error="rpc error: keyring has not been initialized yet"
Oct 14 21:58:31 leader.example.com nomad[276724]:     2022-10-14T21:58:31.726Z [ERROR] worker: error invoking scheduler: worker_id=4ba20503-dafb-ee08-c327-7e0d1779e4ae error="failed to process evaluation: rpc error: keyring has not been initialized yet"
Oct 14 21:58:32 leader.example.com nomad[276724]:     2022-10-14T21:58:32.732Z [ERROR] worker: failed to submit plan for evaluation: worker_id=4ba20503-dafb-ee08-c327-7e0d1779e4ae eval_id=6c8ce736-2a39-836b-1d83-e51714187792 error="rpc error: keyring has not been initialized yet"
Oct 14 21:58:32 leader.example.com nomad[276724]:     2022-10-14T21:58:32.732Z [ERROR] worker: error invoking scheduler: worker_id=4ba20503-dafb-ee08-c327-7e0d1779e4ae error="failed to process evaluation: rpc error: keyring has not been initialized yet"
@tgross tgross self-assigned this Oct 15, 2022
@tgross tgross added the theme/variables Variables feature label Oct 15, 2022
@tgross
Copy link
Member

tgross commented Oct 15, 2022

Hi @optiz0r! Thanks for bringing this discussion over from HangOps!

When regions are federated they're joined in the same serf membership, but there's no raft replication between them. Unfortunately it looks like we're using the serf membership and not the raft cluster for the check so we're unnecessarily blocking keyring initialization even though the keys aren't actually replicated between regions. 🤦‍♂️ I'll take a look early next week as to the best way to get this fixed.

@tgross
Copy link
Member

tgross commented Oct 17, 2022

First pass at a fix is here: #14901

@tgross tgross added this to the 1.4.x milestone Oct 17, 2022
@tgross
Copy link
Member

tgross commented Oct 17, 2022

#14901 has been merged, and we've identified a few other places where the server minimum version check is potentially wrong. This is a long-used pattern and the new keyring feature draws it out much more than some other features we've built in the last few major versions. Once we've got those patches in, we'll have a discussion about how soon we can ship this. Thanks for opening the issue @optiz0r!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants