Health check routine leaks using new nomad provider #15477

thetooth · 2022-12-06T00:53:49Z

Nomad version

Nomad v1.4.3 (f464aca)

Issue

I have a service that logs HTTP requests and noticed that the endpoint given for health checking is being executed a few hundred times per second. There is a pretty aggressive restart policy on this job and we had a netsplit issue last night which lead to the service restarting around 600 times, so the logs are quite busy to say the least.

Reproduction steps

Run the job below and either stop the job and resubmit or have the process crash. The number of requests hitting the service increase until nomad is restarted.

Job file (if appropriate)

job "signaling" {
  datacenters = ["cloud"]
  type        = "service"

  reschedule {
    unlimited      = true
    delay          = "15s"
    delay_function = "constant"
    attempts       = 0
  }

  group "signaling" {
    restart {
      attempts = 2
      delay    = "1s"
      interval = "15s"
      mode     = "fail"
    }

    volume "opt" {
      type      = "host"
      source    = "opt"
      read_only = true
    }

    network {
      mode = "host"
      port "api" {
        static = 8000
      }
    }

    task "signaling" {
      driver = "exec"

      config {
        command = "entry.bash"
      }

      template {
        data = <<EOH
#!/bin/bash

/local/opt/signaling -bind=:8000
EOH

        destination = "local/entry.bash"
        perms       = "755"
      }

      template {
        data        = <<EOH
{{ with nomadVar "proapps" -}}
PROAPPS_ID={{ .id }}
PROAPPS_SECRET={{ .secret }}
{{- end }}
EOH
        destination = "local/file.env"
        env         = true
      }

      volume_mount {
        volume      = "opt"
        destination = "/local/opt"
      }

      resources {
        cpu    = 100 # Mhz
        memory = 512 # Mb
      }

      service {
        name     = "api"
        port     = "api"
        provider = "nomad"
        check {
          type     = "http"
          path     = "/debug/pprof/"
          interval = "5s"
          timeout  = "1s"
          method   = "GET"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

shoenig · 2023-01-23T16:23:20Z

Thanks for the report @thetooth, and apologies for the slow response. I was able to reproduce this with this simpler job and bash script. AFAICT the duplication in requests to the healthcheck happens on reschedule, which TBH is surprising, but at least now I know where to look.

job "demo" {
  datacenters = ["dc1"]

  group "group1" {
    network {
      mode = "host"
      port "http" {
        static = 8888
      }
    }
    
    reschedule {
      unlimited = true
      delay = "15s"
      delay_function = "constant"
      attempts = 0
    }
    
    restart {
      attempts = 2
      delay = "1s"
      interval = "15s"
      mode = "fail"
    }

    task "task1" {
      driver = "raw_exec"
      user = "shoenig"

      config {
        command = "python3"
        args = ["-m", "http.server", "8888", "--directory", "/tmp"]
      }

      service {
        provider = "nomad"
        port = "http"
        check{
          path = "/"
          type = "http"
          interval = "3s"
          timeout = "1s"         
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

#!/usr/bin/env bash

while true
do
  sleep 5
  pid=$(ps -ef | grep http.server | head -n1 | fields 2)
  echo "kill pid $pid"
  kill -9 $pid
done

This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477

* client: run alloc pre-kill hooks on last pass despite no live tasks This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477 * client: do not run ar cleanup hooks if client is shutting down

thetooth added the type/bug label Dec 6, 2022

shoenig added the stage/needs-investigation label Dec 6, 2022

shoenig self-assigned this Dec 6, 2022

shoenig added the theme/service-discovery/nomad label Jan 13, 2023

shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 23, 2023

shoenig added this to the 1.4.x milestone Jan 23, 2023

shoenig mentioned this issue Jan 23, 2023

client: always run alloc cleanup hooks on final update #15855

Merged

shoenig closed this as completed in #15855 Jan 27, 2023

hc-github-team-nomad-core mentioned this issue Jan 27, 2023

Backport of client: always run alloc cleanup hooks on final update into release/1.4.x #15924

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check routine leaks using new nomad provider #15477

Health check routine leaks using new nomad provider #15477

thetooth commented Dec 6, 2022

shoenig commented Jan 23, 2023

Health check routine leaks using new nomad provider #15477

Health check routine leaks using new nomad provider #15477

Comments

thetooth commented Dec 6, 2022

Nomad version

Issue

Reproduction steps

Job file (if appropriate)

shoenig commented Jan 23, 2023