Nomad doesn't observe OOM killed or SIGKILL processes #5408

prologic · 2019-03-12T05:38:19Z

This is somewhat related to #5363 and #5395 but I want to create a new issue to steer the discussion more precisely to what i've been able to reproduce.

I've been able to repro this in:

Nomad v0.8.7
Nomad v0.9.0-beta2
Nomad v0.9.0-beta3

Example reproducer Nomad spec (but not really causing this):

job "hello" {
  type = "service"
  datacenters = ["dc1"]

  task "hello" {
    driver = "raw_exec"

    config {
      command = "hello-signals"
    }
  }
}

The test binary I'm using is written in Go and its source is:

package main

import (
	"flag"
	"fmt"
	"os"
	"os/signal"
)

var ignore bool

func init() {
	flag.BoolVar(&ignore, "ignore", false, "ignore signals")
}

func main() {
	flag.Parse()

	done := make(chan struct{}, 1)

	c := make(chan os.Signal, 1)
	signal.Notify(c)

	go func() {
		for {
			s := <-c
			fmt.Printf("caught signal: %d (%s)\n", s, s.String())
			if !ignore {
				done <- struct{}{}
			} else {
				fmt.Println("-ignore flag passed, ignoring signals")
			}
		}
	}()

	fmt.Println("Hello World!")

	fmt.Println("waiting for signal...")
	<-done
}

Steps to repro:

Run nomad in "dev" mode: ./nomad agent -dev
Deploy the job: ./nomad run hello.nomad
Kill the nomad executor or hello-signals process with kill -KILL $(pidof hello-signals) or pkill -KILL -f 'nomad executor'.

Observe that both the Nomad UI and ./nomad status and ./nomad status and./nomad alloc status` all still think the Job/Tasks is still "Running".

Test in both the raw_exec and exec drivers. I had a quick cursory glance at the codebase looking for evidence of waitpid() being called and I can see os/exec.Cmd.Wait() being called; but I'm not to familiar with the code structure to dive too deep right now.

The text was updated successfully, but these errors were encountered:

nickethier · 2019-03-12T12:58:32Z

Thanks for providing a great reproduction case. We’re looking into this.

endocrimes · 2019-03-13T12:52:11Z

Hey @prologic,

Thanks for the test case, this bug is pretty interesting.

While running on macOS, on killing the binary I see the behaviour we expect to happen (but there is still a bug (1) with executor shutdown):

However, when running on Ubuntu 18.04 I found two distinct failure modes.

On macOS + Linux When a nomad executor is SIGKILL'd, the nomad client detects the executor connection has died, then restarts the task as expected. While this is almost the correct behaviour, it leaks the process that the executor had originally started, and this process will never be terminated.
On Linux (but not macOS), when a task process is SIGKILL'd, the executor supervisor does not always seem to detect this (i saw this behaviour ~twice in a hundred or so attempts), and the client continues to believe that the process is running successfully.

I'll take a further look into these this week and try to land a fix before the Nomad 0.9 RC1.

tzahimizrahi · 2019-03-13T19:04:12Z

@dantoml could you confirm this is the same issue as #5363?

prologic · 2019-03-13T20:37:14Z

Hey @dantoml 2) is what we have observed precisely. Thanks for confirming! We are unable to really utilise macOS in any useful way as we're primarily using the Exec driver. But in general I do too notice the leaked executor and logmon proesseses.

prologic · 2019-03-13T20:38:20Z

@tzahimizrahi Very likely the same; but to be precise #5363 I believe is the "Symptom".

johnzhanghua · 2019-03-14T03:42:21Z

This is a critical issue for a scheduler. Can it be treated with higher priority ? Thanks.

endocrimes · 2019-03-14T13:40:58Z

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

prologic · 2019-03-14T23:42:03Z

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

What precisely do you want a repro of? I thought my repro case above was pretty clear. The config/OS I used above was $ nomad agent -dev on CentOS 7.5-1805

I can see there is a PR up to address part of this problem (but not fully) #5419

endocrimes · 2019-03-15T13:40:17Z

@prologic OS config was what I was looking for, thanks :) - I've been failing to repro completely undetected loss of the user process again, which is quite unfortunate, and was hoping there might be a pointer there, but CentOS behaved as expected in my test.

I'll spin up a test harness and see if I can automate it.

notnoop · 2019-03-21T01:04:56Z

@prologic Thanks for your patience in this one and it does sound like a very serious bug that we are prioritizing. However, I feel like there are some confounding issues that may mask what's going on.

First, how often do you notice nomad not detecting a process being killed out-of-band and how reproducible is it in your environment? Would you be able to demonstrate the case in a Vagrant setup?

I have attempted to reproduce the issue in https://github.com/notnoop/nomad-repro-5408 . In my testing, nomad was always able to detect that the job was killed, when I send SIGKILL to either nomad executor or hello-signal; though the job status may be marked as "running" when the alloc is due to be restarted. When setting a different restart/reschedule policy, the job status may transition pending/failed state. When viewing the UI, the allocation client status appears under job details or alloc page.

The above behavior of job status is a confusing UX problem, where job status doesn't obviously convey the concrete status of its allocations, and we should address it - but want to ensure I'm not missing other more substantial problems.

Did you find cases where a task process (e.g. hello-signal in this case) is killed, yet without nomad alloc status emitting "Terminated" event or a log line like the following (in Nomad 0.8)

    2019/03/20 21:37:26.947950 [INFO] client: task "hello" for alloc "1b0ba82f-6bc3-f056-043a-93c37b4432f5" failed: Wait returned exit code 137, signal 9, and error <nil>

Would love to get the client logs for this case; and if you can contribute a script or reproduce the issue in that repo with sample nomad client logs, that will be great!

prologic · 2019-03-22T04:04:14Z

@notnoop this is not a proper response; I'll try to find the time next week to go over the repro again and share more of what I find.

notnoop · 2019-05-02T11:36:11Z

Closing this issue as it's stale now and I believe that the confusing UX is what's at play here. Please reopen with any info you have to the contrary and we will follow up.

That being said, we agree that #5363 and #5395 need to be addressed, and sorry that the UX is getting in the way of distilling/reproducing it.

github-actions · 2022-11-24T02:19:54Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endocrimes added theme/client stage/needs-investigation labels Mar 12, 2019

endocrimes added type/bug and removed stage/needs-investigation labels Mar 13, 2019

endocrimes mentioned this issue Mar 14, 2019

drivers: Handle out-of-nomad executor termination #5419

Closed

notnoop mentioned this issue Mar 18, 2019

Handle executor termination #5434

Closed

notnoop mentioned this issue Mar 22, 2019

Template failed to send signals [user defined signal 1]: 1 error(s) occurred: * Task not running #5459

Open

notnoop closed this as completed May 2, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad doesn't observe OOM killed or SIGKILL processes #5408

Nomad doesn't observe OOM killed or SIGKILL processes #5408

prologic commented Mar 12, 2019

nickethier commented Mar 12, 2019

endocrimes commented Mar 13, 2019 •

edited

Loading

tzahimizrahi commented Mar 13, 2019

prologic commented Mar 13, 2019

prologic commented Mar 13, 2019

johnzhanghua commented Mar 14, 2019

endocrimes commented Mar 14, 2019

prologic commented Mar 14, 2019

endocrimes commented Mar 15, 2019

notnoop commented Mar 21, 2019

prologic commented Mar 22, 2019

notnoop commented May 2, 2019

github-actions bot commented Nov 24, 2022

Nomad doesn't observe OOM killed or SIGKILL processes #5408

Nomad doesn't observe OOM killed or SIGKILL processes #5408

Comments

prologic commented Mar 12, 2019

nickethier commented Mar 12, 2019

endocrimes commented Mar 13, 2019 • edited Loading

tzahimizrahi commented Mar 13, 2019

prologic commented Mar 13, 2019

prologic commented Mar 13, 2019

johnzhanghua commented Mar 14, 2019

endocrimes commented Mar 14, 2019

prologic commented Mar 14, 2019

endocrimes commented Mar 15, 2019

notnoop commented Mar 21, 2019

prologic commented Mar 22, 2019

notnoop commented May 2, 2019

github-actions bot commented Nov 24, 2022

endocrimes commented Mar 13, 2019 •

edited

Loading