Tasks that are signaled to restart fail due to logmon errors #5574

benvanstaveren · 2019-04-17T14:37:47Z

Nomad version

0.9.0

Operating system and Environment details

Linux, amd64

Issue

Any task that is signaled to restart (due to template changes, etc.) fails to properly restart due to logmon rpc errors.

Reproduction steps

Run service or system task with a template set to change_mode = "restart" - change the template (or a secret the template renders), watch it blow.

Nomad Client logs (if appropriate)

Events from a failed allocation:

endocrimes · 2019-04-17T15:35:56Z

hey @benvanstaveren,

Thanks for reporting this - We haven't been able to reproduce this with the docker or raw_exec drivers so could do with some more information to diagnose+fix the problem.

Please could you share your nomad job file, and if at all possible, verbose logs from a nomad-client running the allocation during the time of the restart?

If you don't feel comfortable sharing these on GitHub, you can also email nomad-oss-debug@hashicorp.com, which is a mailing list readable by all HashiCorp employees, but not by the public.

Thank you!

benvanstaveren · 2019-04-17T16:29:41Z

hey @benvanstaveren,

Thanks for reporting this - We haven't been able to reproduce this with the docker or raw_exec drivers so could do with some more information to diagnose+fix the problem.

Please could you share your nomad job file, and if at all possible, verbose logs from a nomad-client running the allocation during the time of the restart?

If you don't feel comfortable sharing these on GitHub, you can also email nomad-oss-debug@hashicorp.com, which is a mailing list readable by all HashiCorp employees, but not by the public.

Thank you!

Sure, no problem :) It's actually happened to a system job that runs Telegraf to collect (of all things) Nomad metrics so, here is the job file (with a few items redacted)

Job file:

job "nomad-metrics" {
    datacenters = [ "FSN", "HEL", "GRA", "PAR" ]
    type = "system"

    update {
        stagger = "1s"
        max_parallel = 1
    }

    group "telegraf" {
        restart {
            attempts = 10
            interval = "5m"
            delay = "25s"
            mode = "delay"
        }

        task "telegraf" {
            vault {
                policies = [ "nomad-certificate-client", "influx-nomadmetrics" ]
            }
            driver = "docker"
            config {
                network_mode = "host"
                userns_mode = "host"
                image = "telegraf:alpine"
                force_pull = true
                args = [
                    "-config",
                    "/local/telegraf.conf"
                ]
            }

            template {
                destination = "local/nomad.crt"
                data = <<EOHCERT
{{ with secret "pki/nomad/issue/client-certificate" "common_name=xxxx.global.nomad" "ttl=604800" }}
{{ .Data.certificate }}{{ end }}
EOHCERT
            }
            template {
                destination = "local/nomad.key"
                data = <<EOHKEY
{{ with secret "pki/nomad/issue/client-certificate" "common_name=xxxx.global.nomad" "ttl=604800" }}
{{ .Data.private_key }}{{ end }}
EOHKEY
            }
            artifact {
                source      = "http://vault-instance-here/v1/pki/nomad/ca/pem"
                destination = "local/nomad-ca.pem"
                mode        = "file"
            }

            template {
                destination = "local/telegraf.conf"
                change_mode = "restart"
                data = <<EOTC
# Adding Client class
# This should be here until https://github.com/hashicorp/nomad/pull/3882 is merged
{{ $node_class := env "node.class" }}
[global_tags]
{{ if $node_class }}
  nomad_client_class = "{{ env "node.class" }}"
{{else}}
  nomad_client_class = "none"
{{ end }}

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "3s"
  precision = ""
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://influxdb.service.consul:8086"]
  database = "nomad_metrics"
{{ with secret "db/influxdb/creds/nomadmetrics" }}
  username = "{{ .Data.username }}"
  password = "{{ .Data.password }}"
{{ end }}
  retention_policy = "autogen"
  timeout = "5s"
  user_agent = "telegraf-{{env "NOMAD_TASK_NAME" }}"
  skip_database_creation = true
[[inputs.prometheus]]
  urls = ["https://{{ env "meta.public_ip" }}:4646/v1/metrics?format=prometheus"]
  tls_ca = "/local/nomad-ca.pem"
  tls_cert = "/local/nomad.crt"
  tls_key = "/local/nomad.key"
  insecure_skip_verify = true
EOTC
            }
        }
    }
}

Here are the log entries relevant to the above job being restarted and failing, unfortunately I can't enable verbose logging at the moment because this is our production environment, so this is the default logs that Nomad tosses into syslog for the time being.

Apr 17 13:10:20 xxx-001 nomad[29089]: client.driver_mgr.docker: started container: driver=docker container_id=ea8510277907a65e41dfe103e43cf9f00697b2f62ee62e17464b2caf4a6eef20
Apr 17 13:10:20 xxx-001 consul[19698]:     2019/04/17 11:10:20 [INFO] agent: Synced service "_nomad-task-42agarg7i5xa2uojvbqsjhzru2wnfaw7"
Apr 17 15:40:40 xxx-001 nomad[29089]:     2019/04/17 13:40:40.377644 [INFO] (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.crt"
Apr 17 15:40:40 xxx-001 nomad[29089]: (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.crt"
Apr 17 15:40:40 xxx-001 nomad[29089]:     2019/04/17 13:40:40.385787 [INFO] (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.key"
Apr 17 15:40:40 xxx-001 nomad[29089]: (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.key"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.830Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason= delay=0s
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason= delay=0s
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.833Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.834Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="prestart hook "logmon" failed: rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.834Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason="Error was unrecoverable"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: prestart failed: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="prestart hook "logmon" failed: rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: not restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason="Error was unrecoverable"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.836Z [INFO ] client.gc: marking allocation for GC: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510
Apr 17 15:40:41 xxx-001 nomad[29089]: client.gc: marking allocation for GC: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510

Also as a thing to note, this only seems to happen on jobs with tasks that have network_mode = "host" and userns_mode = "host" set.

benvanstaveren · 2019-04-17T17:04:40Z

I can, probably, switch the log level to debug on one of our production boxes and trigger the error, but it has to wait until night time in Europe or I'll have the entire support desk in my office with the pitchforks and torches.

endocrimes · 2019-04-17T17:43:44Z

@benvanstaveren That would be very useful - I'm also based in the EU so I can pick this up first thing tomorrow if so.

benvanstaveren · 2019-04-17T19:07:56Z

@benvanstaveren That would be very useful - I'm also based in the EU so I can pick this up first thing tomorrow if so.

I have emailed it to the address mentioned above as well as the job file, since it's a rather large file and has too much info in it for me to redact. However, an item of interest is that the exact same task in another job (same image, same configuration) that does not run with userns_mode = "host" and network_mode = "host" in the docker driver config section worked fine and got restarted properly.

benvanstaveren · 2019-04-17T19:18:03Z

Alright, email has been sent - hope it gives you some clues. It does seem to be tied to having either all or a combination of userns_mode, network_mode and privileged settings in the docker driver config for a task.

benvanstaveren · 2019-04-17T20:07:49Z

Okay sorry for the amount of noise but here's what I've managed to fiddle together:

network_mode/host_mode/privileged apparently isn't the culprit - or it may be, but: the job that restarted successfully had labels set in the docker config:

            labels {
                "service_id"            = "${NOMAD_JOB_NAME}/${NOMAD_GROUP_NAME}/${NOMAD_TASK_NAME}"
                "service_job"           = "${NOMAD_JOB_NAME}"
                "log_format"            = "plain"
            }

I have added this same set of labels to the job that failed (the one I sent the details for by email), tried to reproduce the problem, and nothing happened, they restarted fine just like the other one.

preetapan · 2019-04-24T16:36:13Z

@benvanstaveren while we keep investigating this, wanted to let you know that we made our logmon process a bit more reliable with some fixes in 0.9.1-rc, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would you be able to try this out and let us know if you are still seeing problems with restarts?

benvanstaveren · 2019-04-25T10:44:34Z

I can try but it won't be this week, I'll have to do this over a weekend so I don't interrupt production :)

endocrimes · 2019-04-25T12:46:10Z

@benvanstaveren Thank you!

mpodkolzin · 2019-04-27T02:57:00Z

It seems that I'm experiencing the same problem on windows with both 0.90 and 0.9.1-rc1

benvanstaveren · 2019-04-29T07:39:32Z

Sorry folks, couldn't pull it off over the weekend, will see if I can make an attempt during a weeknight with the rc1. Do I need to update the entire cluster to rc1 or just the machines running the task that causes the issues?

endocrimes · 2019-04-29T08:12:01Z

@benvanstaveren just the clients should be fine 👍 - thanks and no problem

endocrimes · 2019-04-29T08:12:33Z

@mpodkolzin Do you happen to have logs from the clients and also task events involved? - Thanks!

benvanstaveren · 2019-04-30T12:38:20Z

I noticed the rc1 has been released as 0.9.1 so I'll put in an upgrade and see if I can trigger the error, probably sometime this week so will inform you all of the result here :)

notnoop · 2019-04-30T19:27:43Z

@benvanstaveren btw - 0.9.1 included few fixes in addition to the ones in rc1 - namely, #5616 . Looking at your logs from earlier, I believe this is the case you are hitting. Please try it and let us know

stale · 2019-05-30T19:32:05Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-22T02:30:08Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nickethier added type/bug theme/client labels Apr 17, 2019

schmichael added the stage/waiting-reply label Apr 17, 2019

endocrimes removed the stage/waiting-reply label Apr 18, 2019

endocrimes added the stage/waiting-reply label Apr 25, 2019

stale bot closed this as completed May 30, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks that are signaled to restart fail due to logmon errors #5574

Tasks that are signaled to restart fail due to logmon errors #5574

benvanstaveren commented Apr 17, 2019

endocrimes commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019 •

edited

Loading

benvanstaveren commented Apr 17, 2019

endocrimes commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

preetapan commented Apr 24, 2019

benvanstaveren commented Apr 25, 2019

endocrimes commented Apr 25, 2019

mpodkolzin commented Apr 27, 2019

benvanstaveren commented Apr 29, 2019

endocrimes commented Apr 29, 2019

endocrimes commented Apr 29, 2019

benvanstaveren commented Apr 30, 2019

notnoop commented Apr 30, 2019

stale bot commented May 30, 2019

github-actions bot commented Nov 22, 2022

Tasks that are signaled to restart fail due to logmon errors #5574

Tasks that are signaled to restart fail due to logmon errors #5574

Comments

benvanstaveren commented Apr 17, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

endocrimes commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019 • edited Loading

benvanstaveren commented Apr 17, 2019

endocrimes commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

benvanstaveren commented Apr 17, 2019

preetapan commented Apr 24, 2019

benvanstaveren commented Apr 25, 2019

endocrimes commented Apr 25, 2019

mpodkolzin commented Apr 27, 2019

benvanstaveren commented Apr 29, 2019

endocrimes commented Apr 29, 2019

endocrimes commented Apr 29, 2019

benvanstaveren commented Apr 30, 2019

notnoop commented Apr 30, 2019

stale bot commented May 30, 2019

github-actions bot commented Nov 22, 2022

benvanstaveren commented Apr 17, 2019 •

edited

Loading