-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Raft appears to overload itself with logs and stops creating allocations #14915
Comments
This is definitely concerning! I would love to see server logs from the leader election if possible. In the future running These files could be shared privately via our nomad-oss-debug@hashicorp.com email. Neither logs nor debug bundles will contain secrets, but they will contain lots of other information like job names and usage patterns that you may not want to post publicly.
Is this cluster federated with a pre-1.4 cluster? If so #14901 could cause this sort of failure, but it should have happened on upgrade to 1.4 -- not after... unless you didn't federate until after you upgraded to 1.4? Server logs would show that error. |
Hi @schmichael, thanks for the quick response! This was a plain old cluster, no federation or anything advanced beyond ACLs and a lot of jobs. It was upgraded to 1.4.0 the week previous to deal with the autoscaling-while-deploying bug. Unfortunately it looks like our server logs are gone (we haven't needed them before so we didn't notice they weren't being shipped), and while it seemed like a good idea to have our ASG recycle instances that failed the Now that we know what to look for we've set up alerting to detect this issue, so next time it happens the big things we want to collect are:
Let me know if there's anything else, and I'll update this ticket when we're able to capture that information the next time this happens. |
Hopefully that does it. Another thing you can try to avoid having to restore from snapshot is to shrink down to 1 server and restart it with |
@schmichael So we found the root cause. Turns out it was #14981 When it happened again we saw this in the logs:
And some quick googling led us to the keyring issues that popped up in the 1.4.0 release. Running Thanks for the help on this, and I'm glad it was quickly fixed! |
Argh, sorry that hit you @douglaje but thanks for coming back to close the issue. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.4.0
Operating system and Environment details
Debian 11, Linux 5.10.0-18-cloud-arm64 (running on AWS m6g.medium instance)
5 nomad servers, 20+ clients (this is our staging cluster)
Issue
Note: This eventually led to complete loss of our nomad server quorum, and we had to restore from backup.
The first symptom happened early Saturday morning. Calling that T-00:00. We had what appears to be a raft leader switch

After this leader switch, all our current jobs kept running but new allocs (primarily periodic jobs) almost completely stopped.

At the same time, the raft log apply index delta (applyindex change over time) started growing linearly. So did all the other measures of the number of raft logs being applied per second:

At T-10:00 (10 hours later), the nomad servers became so unstable that they began rapidly switching leaders and failing to respond to on the

/v1/agent/health
endpoint and the cluster lost raft consensus.Reproduction steps
We haven't been able to reproduce this yet, but we believe it happened to us ~2 months ago. At the time, we believed the cause was that our nomad servers were too small, and so didn't look any deeper into the incident.
I'm raising this issue because I'm at a loss for what to do next to debug it. As far as I can tell it started with a run-of-the-mill leader re-election but it somehow crippled our cluster and then gradually overloaded it.
I'm trying to get logs for at least the leader re-election, but we may have lost them due to a logging misconfiguration plus the instances being recycled once the nomad
/v1/agent/health
endpoint started timing out.The text was updated successfully, but these errors were encountered: