Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad doesn't observe OOM killed or SIGKILL processes #5408

Closed
prologic opened this issue Mar 12, 2019 · 13 comments
Closed

Nomad doesn't observe OOM killed or SIGKILL processes #5408

prologic opened this issue Mar 12, 2019 · 13 comments

Comments

@prologic
Copy link

This is somewhat related to #5363 and #5395 but I want to create a new issue to steer the discussion more precisely to what i've been able to reproduce.

I've been able to repro this in:

  • Nomad v0.8.7
  • Nomad v0.9.0-beta2
  • Nomad v0.9.0-beta3

Example reproducer Nomad spec (but not really causing this):

job "hello" {
  type = "service"
  datacenters = ["dc1"]

  task "hello" {
    driver = "raw_exec"

    config {
      command = "hello-signals"
    }
  }
}

The test binary I'm using is written in Go and its source is:

package main

import (
	"flag"
	"fmt"
	"os"
	"os/signal"
)

var ignore bool

func init() {
	flag.BoolVar(&ignore, "ignore", false, "ignore signals")
}

func main() {
	flag.Parse()

	done := make(chan struct{}, 1)

	c := make(chan os.Signal, 1)
	signal.Notify(c)

	go func() {
		for {
			s := <-c
			fmt.Printf("caught signal: %d (%s)\n", s, s.String())
			if !ignore {
				done <- struct{}{}
			} else {
				fmt.Println("-ignore flag passed, ignoring signals")
			}
		}
	}()

	fmt.Println("Hello World!")

	fmt.Println("waiting for signal...")
	<-done
}

Steps to repro:

  • Run nomad in "dev" mode: ./nomad agent -dev
  • Deploy the job: ./nomad run hello.nomad
  • Kill the nomad executor or hello-signals process with kill -KILL $(pidof hello-signals) or pkill -KILL -f 'nomad executor'.

Observe that both the Nomad UI and ./nomad status and ./nomad status and./nomad alloc status` all still think the Job/Tasks is still "Running".

Test in both the raw_exec and exec drivers. I had a quick cursory glance at the codebase looking for evidence of waitpid() being called and I can see os/exec.Cmd.Wait() being called; but I'm not to familiar with the code structure to dive too deep right now.

@nickethier
Copy link
Member

Thanks for providing a great reproduction case. We’re looking into this.

@endocrimes
Copy link
Contributor

endocrimes commented Mar 13, 2019

Hey @prologic,

Thanks for the test case, this bug is pretty interesting.

While running on macOS, on killing the binary I see the behaviour we expect to happen (but there is still a bug (1) with executor shutdown):
image

However, when running on Ubuntu 18.04 I found two distinct failure modes.

  1. On macOS + Linux When a nomad executor is SIGKILL'd, the nomad client detects the executor connection has died, then restarts the task as expected. While this is almost the correct behaviour, it leaks the process that the executor had originally started, and this process will never be terminated.

  2. On Linux (but not macOS), when a task process is SIGKILL'd, the executor supervisor does not always seem to detect this (i saw this behaviour ~twice in a hundred or so attempts), and the client continues to believe that the process is running successfully.

I'll take a further look into these this week and try to land a fix before the Nomad 0.9 RC1.

@tzahimizrahi
Copy link

@dantoml could you confirm this is the same issue as #5363?

@prologic
Copy link
Author

Hey @dantoml 2) is what we have observed precisely. Thanks for confirming! We are unable to really utilise macOS in any useful way as we're primarily using the Exec driver. But in general I do too notice the leaked executor and logmon proesseses.

@prologic
Copy link
Author

@tzahimizrahi Very likely the same; but to be precise #5363 I believe is the "Symptom".

@johnzhanghua
Copy link

This is a critical issue for a scheduler. Can it be treated with higher priority ? Thanks.

@endocrimes
Copy link
Contributor

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

@prologic
Copy link
Author

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

What precisely do you want a repro of? I thought my repro case above was pretty clear. The config/OS I used above was $ nomad agent -dev on CentOS 7.5-1805

I can see there is a PR up to address part of this problem (but not fully) #5419

@endocrimes
Copy link
Contributor

@prologic OS config was what I was looking for, thanks :) - I've been failing to repro completely undetected loss of the user process again, which is quite unfortunate, and was hoping there might be a pointer there, but CentOS behaved as expected in my test.

I'll spin up a test harness and see if I can automate it.

@notnoop
Copy link
Contributor

notnoop commented Mar 21, 2019

@prologic Thanks for your patience in this one and it does sound like a very serious bug that we are prioritizing. However, I feel like there are some confounding issues that may mask what's going on.

First, how often do you notice nomad not detecting a process being killed out-of-band and how reproducible is it in your environment? Would you be able to demonstrate the case in a Vagrant setup?

I have attempted to reproduce the issue in https://github.com/notnoop/nomad-repro-5408 . In my testing, nomad was always able to detect that the job was killed, when I send SIGKILL to either nomad executor or hello-signal; though the job status may be marked as "running" when the alloc is due to be restarted. When setting a different restart/reschedule policy, the job status may transition pending/failed state. When viewing the UI, the allocation client status appears under job details or alloc page.

The above behavior of job status is a confusing UX problem, where job status doesn't obviously convey the concrete status of its allocations, and we should address it - but want to ensure I'm not missing other more substantial problems.

Did you find cases where a task process (e.g. hello-signal in this case) is killed, yet without nomad alloc status emitting "Terminated" event or a log line like the following (in Nomad 0.8)

    2019/03/20 21:37:26.947950 [INFO] client: task "hello" for alloc "1b0ba82f-6bc3-f056-043a-93c37b4432f5" failed: Wait returned exit code 137, signal 9, and error <nil>

Would love to get the client logs for this case; and if you can contribute a script or reproduce the issue in that repo with sample nomad client logs, that will be great!

@prologic
Copy link
Author

@notnoop this is not a proper response; I'll try to find the time next week to go over the repro again and share more of what I find.

@notnoop
Copy link
Contributor

notnoop commented May 2, 2019

Closing this issue as it's stale now and I believe that the confusing UX is what's at play here. Please reopen with any info you have to the contrary and we will follow up.

That being said, we agree that #5363 and #5395 need to be addressed, and sorry that the UX is getting in the way of distilling/reproducing it.

@notnoop notnoop closed this as completed May 2, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants