-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad doesn't observe OOM killed or SIGKILL processes #5408
Comments
Thanks for providing a great reproduction case. We’re looking into this. |
Hey @prologic, Thanks for the test case, this bug is pretty interesting. While running on macOS, on killing the binary I see the behaviour we expect to happen (but there is still a bug (1) with executor shutdown): However, when running on Ubuntu 18.04 I found two distinct failure modes.
I'll take a further look into these this week and try to land a fix before the Nomad 0.9 RC1. |
Hey @dantoml 2) is what we have observed precisely. Thanks for confirming! We are unable to really utilise macOS in any useful way as we're primarily using the |
@tzahimizrahi Very likely the same; but to be precise #5363 I believe is the "Symptom". |
This is a critical issue for a scheduler. Can it be treated with higher priority ? Thanks. |
@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now) |
What precisely do you want a repro of? I thought my repro case above was pretty clear. The config/OS I used above was I can see there is a PR up to address part of this problem (but not fully) #5419 |
@prologic OS config was what I was looking for, thanks :) - I've been failing to repro completely undetected loss of the user process again, which is quite unfortunate, and was hoping there might be a pointer there, but CentOS behaved as expected in my test. I'll spin up a test harness and see if I can automate it. |
@prologic Thanks for your patience in this one and it does sound like a very serious bug that we are prioritizing. However, I feel like there are some confounding issues that may mask what's going on. First, how often do you notice nomad not detecting a process being killed out-of-band and how reproducible is it in your environment? Would you be able to demonstrate the case in a Vagrant setup? I have attempted to reproduce the issue in https://github.com/notnoop/nomad-repro-5408 . In my testing, nomad was always able to detect that the job was killed, when I send SIGKILL to either The above behavior of job status is a confusing UX problem, where job status doesn't obviously convey the concrete status of its allocations, and we should address it - but want to ensure I'm not missing other more substantial problems. Did you find cases where a task process (e.g. hello-signal in this case) is killed, yet without nomad alloc status emitting "Terminated" event or a log line like the following (in Nomad 0.8)
Would love to get the client logs for this case; and if you can contribute a script or reproduce the issue in that repo with sample nomad client logs, that will be great! |
@notnoop this is not a proper response; I'll try to find the time next week to go over the repro again and share more of what I find. |
Closing this issue as it's stale now and I believe that the confusing UX is what's at play here. Please reopen with any info you have to the contrary and we will follow up. That being said, we agree that #5363 and #5395 need to be addressed, and sorry that the UX is getting in the way of distilling/reproducing it. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
This is somewhat related to #5363 and #5395 but I want to create a new issue to steer the discussion more precisely to what i've been able to reproduce.
I've been able to repro this in:
Example reproducer Nomad spec (but not really causing this):
The test binary I'm using is written in Go and its source is:
Steps to repro:
./nomad agent -dev
./nomad run hello.nomad
nomad executor
orhello-signals
process withkill -KILL $(pidof hello-signals)
orpkill -KILL -f 'nomad executor'
.Observe that both the Nomad UI and
./nomad status
and ./nomad statusand
./nomad alloc status` all still think the Job/Tasks is still "Running".Test in both the
raw_exec
andexec
drivers. I had a quick cursory glance at the codebase looking for evidence ofwaitpid()
being called and I can seeos/exec.Cmd.Wait()
being called; but I'm not to familiar with the code structure to dive too deep right now.The text was updated successfully, but these errors were encountered: