-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client not starting a placed allocation #3227
Comments
@nugend It would be hard to diagnose what happened with the current information. Would need to see the logs, evaluations and allocations to really get a clear picture but I can help describe Nomad behaviors in this scenario.
There must have been something that was causing the jobs not to schedule. The scheduler will attempt a placement if there is room and all constraints are met. If these conditions are not met, Nomad creates a followup evaluation that will block until there is more freed resources or a new type of node is added that may meet the set of constraints on the job. This information can be seen by running
Nomad uses the priority to give first access to the schedulers but will not block access because a service job is trying to schedule. In future versions of Nomad, the priority will be also be used to do pre-emption. Hope that helps! If this happens again, please grab the logs, evals, and allocations for effect jobs! 👍 |
@dadgar Pretty sure I saw this again. Logs contained absolutely nothing of interest except that the jobs which were in a crashloop were getting restarted. We had the same configuration problem in two environments this time and the "issue" seemed to only show up in one of the environments, so I'm guessing it's a bug or something. If I can figure out how to reproduce the issue consistently, I'll ask you to reopen. |
This may be a bad title for this issue. I've seen this again in two different environments this week. There is really nothing interesting or revealing in the logs. I'm running a 3 node setup where all clients are also masters. The master node is executing allocations, the client nodes are not, the allocations assigned to those nodes are stuck pending. Still, I've only noticed this when there's an infinite crash loop occurring on an unrelated process. I was trying to determine if this had something to do with leader election, but don't see any leader elections in the last 24 hours of log files and unless I'm reading the documentation wrong, I don't think there's a way to see when the last leader election was. I am at a total loss as to what's causing this issue. Here's an example of the nomad job status for one job:
As of me executing this status command, the local time is 17:26 EST, but there doesn't appear to be an issue with actually placing the job, just getting it going from the pending state. |
What is the output of |
Had to restart the cluster. The next time I see this show up, I'll give that a shot. Hopefully it'll be a while. |
Think I'm seeing another similar situation:
I'm pretty sure the Client.Advertise.HTTP property is set since I can access the HTTP API on this and all other nodes, so I'm not sure what that error message is telling me. The agent-info command doesn't specify that value though. Additionally, in case it's pertinent, the command produces identical results from all nodes in the cluster. Here's the node status in case that helps:
|
Can you just confirm to me that this shouldn't be happening? That is, that this isn't due to resource exhaustion? |
@nugend Can you output Can you also grab and post the client logs including from a little before the allocation was created so we can see the behavior start to finish in logs. |
|
@nugend What version of Nomad? |
Nomad v0.7.1 (0b295d3) |
Can you share the client logs? |
It's a bit harder for that. There's a lot of info that I need to remove and I don't know if we've got them back far enough to be useful at the moment. I'm working on my end to get the issues preventing me from discussing this more freely cleared up. The "good" news is that this seems to happen regularly enough that we'll almost certainly be able to investigate it again at a later point. |
@nugend Do you have reproduction steps that I could use? |
Nope. Haven't been able to drop this down to a simple case yet. At the moment, my working theory is "start a metric crapload of one off jobs" over a few weeks. |
We've also seen that our service job allocations are becoming stuck in
When running a
The results of
And here are the client logs from when the job was submitted:
Let me know if there is anything additional logging I can enable to shed light why the allocations become stuck. |
I think I found the source of the issue for our cluster. One of our server's log entry was our of sync with the rest of the nodes (3 server quorum). It was found by comparing their last_log_index:
The fix was to stop nomad, delete /var/lib/nomad/server and start the service again. Once the log entries were in sync, all the issues went away. Hope that helps! |
@vincenthuynh I'll look out for it the next time we see this issue. |
@vincenthuynh Nope. We're seeing the issue today and they're all the same on our env. We'll try your fix anyway though. |
Are you seeing the stuck allocations on a particular client node? We treat our client nodes as 'cattle'/stateless so we just created a new client node and kicked out the bad one, but it seemed to still come back. Perhaps you can try that? |
Yeah. The one thing I'm noticing in the logs is that after the issue occurs, there are basically no more log messages for The healthy nodes are emitting periodic messages with varying counts like the following:
The unhealthy node shows this as the last instance of that log message right about when the issue manifested:
Of course, that doesn't exactly tell me if it's a client or server issue. (Like, is the client blocking for an update and the server never sends one? Or is the server sending updates, but the client goproc is crashed?) |
@vincenthuynh We tried your suggested remediation and it sort-of worked. The pending allocations began running, but the agent also tried to start new instances of long-running allocations that had remained executing. |
Sounds like the scheduler lost track of the long-running allocation and/or got into an inconsistent state... Hopefully someone familiar with the scheduler can help shed some light on this? |
@nugend Next time this happens would you be willing to kill the particular nomad agent that isn't starting the allocations with a SIGQUIT (3) so we can get a stack dump. Are the allocations getting stuck always Docker driver? What else do they have in common, are they all using templates, vault, service definitions, etc? Could you potentially share the job file of a few that have gotten stuck |
For us, they're always raw_exec because that's the only driver we use. There is no consistency in terms of the provenance except that it's across a given node that's in a "bad state". That is, usually it's only one node, on occasion it's been two nodes. We haven't seen all three in our setup have this problem yet. I will be sure to capture a stack dump next time. Sorry, would've done so on last occurrence had I known. |
Seeing this as well, right now actually. I'll see if I can get a dump over here. @dadgar: do want it uploaded somewhere? |
Not sure if this is related or if i should start a new issue. |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.6.0
I had an issue recently where a number of batch jobs were pending, but they didn't seemed to be blocked on any resource. While tailing the logs, nomad seemed to be in a loop trying to allocate a service job whose sole task was starting and exiting extremely quickly (it had a bad configuration and was using the default restart stanza). I did try stopping it to see if that would cause the pending batch jobs to get allocations, but due to what appears to be an unrelated bug, nomad continued to try and allocate for the service job despite the desired state being dead (#3217 maybe?).
My question is: Is the intended behavior that nomad continue to try to allocate the service job and ignore the batch jobs for allocation placement? I haven't adjusted any priority values in my current job specifications, so I would presume they are all at the default of 50. Normally, I would actually prefer that the service jobs have a higher priority, but it did seem a little funny that nomad wasn't trying to schedule anything else.
The text was updated successfully, but these errors were encountered: