Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Raft appears to overload itself with logs and stops creating allocations #14915

Closed
douglaje opened this issue Oct 17, 2022 · 6 comments
Closed

Comments

@douglaje
Copy link

Nomad version

Nomad v1.4.0

Operating system and Environment details

Debian 11, Linux 5.10.0-18-cloud-arm64 (running on AWS m6g.medium instance)

5 nomad servers, 20+ clients (this is our staging cluster)

Issue

Note: This eventually led to complete loss of our nomad server quorum, and we had to restore from backup.

The first symptom happened early Saturday morning. Calling that T-00:00. We had what appears to be a raft leader switch
image

After this leader switch, all our current jobs kept running but new allocs (primarily periodic jobs) almost completely stopped.
image

At the same time, the raft log apply index delta (applyindex change over time) started growing linearly. So did all the other measures of the number of raft logs being applied per second:
image

At T-10:00 (10 hours later), the nomad servers became so unstable that they began rapidly switching leaders and failing to respond to on the /v1/agent/health endpoint and the cluster lost raft consensus.
image

Reproduction steps

We haven't been able to reproduce this yet, but we believe it happened to us ~2 months ago. At the time, we believed the cause was that our nomad servers were too small, and so didn't look any deeper into the incident.

I'm raising this issue because I'm at a loss for what to do next to debug it. As far as I can tell it started with a run-of-the-mill leader re-election but it somehow crippled our cluster and then gradually overloaded it.

I'm trying to get logs for at least the leader re-election, but we may have lost them due to a logging misconfiguration plus the instances being recycled once the nomad /v1/agent/health endpoint started timing out.

@schmichael
Copy link
Member

This is definitely concerning! I would love to see server logs from the leader election if possible. In the future running nomad operator debug would be extremely useful, especially the goroutine dump. Note that if a leader has not been elected you need to run the command with the -stale flag.

These files could be shared privately via our nomad-oss-debug@hashicorp.com email. Neither logs nor debug bundles will contain secrets, but they will contain lots of other information like job names and usage patterns that you may not want to post publicly.

/v1/agent/health returning unhealthy or timing out indicates a failure to elect a leader.

Is this cluster federated with a pre-1.4 cluster? If so #14901 could cause this sort of failure, but it should have happened on upgrade to 1.4 -- not after... unless you didn't federate until after you upgraded to 1.4?

Server logs would show that error.

@douglaje
Copy link
Author

Hi @schmichael, thanks for the quick response!

This was a plain old cluster, no federation or anything advanced beyond ACLs and a lot of jobs. It was upgraded to 1.4.0 the week previous to deal with the autoscaling-while-deploying bug.

Unfortunately it looks like our server logs are gone (we haven't needed them before so we didn't notice they weren't being shipped), and while it seemed like a good idea to have our ASG recycle instances that failed the /v1/agent/health endpoint, we're realizing now that it makes recovery and postmortem nearly impossible, so we've removed that.

Now that we know what to look for we've set up alerting to detect this issue, so next time it happens the big things we want to collect are:

  • nomad operator debug for collecting, well, looking at the documentation, basically everything 😂
  • Additional server logs surrounding the leader election

Let me know if there's anything else, and I'll update this ticket when we're able to capture that information the next time this happens.

@schmichael
Copy link
Member

Hopefully that does it. Another thing you can try to avoid having to restore from snapshot is to shrink down to 1 server and restart it with bootstrap_expect=1 + only itself in peers.json. That might be more of a hassle then using a backup though. I've used that "drop down to 1 server" trick to get out of past Raft issues though, and you can add in 2 more servers as soon as the first is back up. (Just also remember to set bootstrap_expect=3 in the first server's config once you've restored the cluster!)

@douglaje
Copy link
Author

douglaje commented Nov 3, 2022

@schmichael So we found the root cause. Turns out it was #14981

When it happened again we saw this in the logs:

2022-10-27T00:17:12.512Z [ERROR] worker: error invoking scheduler: worker_id=59d0dddf-7d06-dc9f-e042-e99665088f9f error="failed to process evaluation: no such key \"2b431800-d36d-2284-7c48-4a6a485f14bb\" in keyring"
2022-10-27T00:17:12.513Z [DEBUG] worker: nack evaluation: worker_id=59d0dddf-7d06-dc9f-e042-e99665088f9f eval_id=ee15b028-0e75-4a1c-3515-dc82a06337a8 type=batch namespace=foo job_id=pfoo/periodic-1666829100 node_id="" triggered_by=failed-follow-up
2022-10-27T00:17:12.975Z [ERROR] nomad: failed to submit plan: error="no such key \"2b431800-d36d-2284-7c48-4a6a485f14bb\" in keyring"
2022-10-27T00:17:13.044Z [ERROR] nomad: failed to submit plan: error="no such key \"2b431800-d36d-2284-7c48-4a6a485f14bb\" in keyring"
2022-10-27T00:17:13.513Z [DEBUG] worker: dequeued evaluation: worker_id=59d0dddf-7d06-dc9f-e042-e99665088f9f eval_id=ee15b028-0e75-4a1c-3515-dc82a06337a8 type=batch namespace=foo job_id=pfoo/periodic-1666829100 node_id="" triggered_by=failed-follow-up
2022-10-27T00:17:13.513Z [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=ee15b028-0e75-4a1c-3515-dc82a06337a8 job_id=pfoo/periodic-1666829100 namespace=foo worker_id=59d0dddf-7d06-dc9f-e042-e99665088f9f
  results=
  | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "pfoo": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
  
2022-10-27T00:17:13.515Z [ERROR] nomad: failed to submit plan: error="no such key \"2b431800-d36d-2284-7c48-4a6a485f14bb\" in keyring"
2022-10-27T00:17:13.516Z [ERROR] worker: failed to submit plan for evaluation: worker_id=59d0dddf-7d06-dc9f-e042-e99665088f9f eval_id=ee15b028-0e75-4a1c-3515-dc82a06337a8 error="no such key \"2b431800-d36d-2284-7c48-

And some quick googling led us to the keyring issues that popped up in the 1.4.0 release. Running nomad operator root keyring rotate immediately fixed the problem with the cluster not making new allocations and upgrading to 1.4.2 appears to have fully solved the problem (or at least it hasn't happened again since).

Thanks for the help on this, and I'm glad it was quickly fixed!

@douglaje douglaje closed this as completed Nov 3, 2022
@schmichael
Copy link
Member

Argh, sorry that hit you @douglaje but thanks for coming back to close the issue.

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants