Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks that are signaled to restart fail due to logmon errors #5574

Closed
benvanstaveren opened this issue Apr 17, 2019 · 18 comments
Closed

Tasks that are signaled to restart fail due to logmon errors #5574

benvanstaveren opened this issue Apr 17, 2019 · 18 comments

Comments

@benvanstaveren
Copy link

Nomad version

0.9.0

Operating system and Environment details

Linux, amd64

Issue

Any task that is signaled to restart (due to template changes, etc.) fails to properly restart due to logmon rpc errors.

Reproduction steps

Run service or system task with a template set to change_mode = "restart" - change the template (or a secret the template renders), watch it blow.

Nomad Client logs (if appropriate)

Events from a failed allocation:

51 minutes | 1 day | Restart Signaled |   |   | 0
51 minutes | 0 seconds | Terminated |   |   | 0
51 minutes | 0 seconds | Restarting |   |   | 0
51 minutes | 0 seconds | Task hook failed | logmon: rpc error: code = Unavailable desc = transport is closing |   | 0
51 minutes | 0 seconds | Not Restarting | Error was unrecoverable

@endocrimes
Copy link
Contributor

hey @benvanstaveren,

Thanks for reporting this - We haven't been able to reproduce this with the docker or raw_exec drivers so could do with some more information to diagnose+fix the problem.

Please could you share your nomad job file, and if at all possible, verbose logs from a nomad-client running the allocation during the time of the restart?

If you don't feel comfortable sharing these on GitHub, you can also email nomad-oss-debug@hashicorp.com, which is a mailing list readable by all HashiCorp employees, but not by the public.

Thank you!

@benvanstaveren
Copy link
Author

benvanstaveren commented Apr 17, 2019

hey @benvanstaveren,

Thanks for reporting this - We haven't been able to reproduce this with the docker or raw_exec drivers so could do with some more information to diagnose+fix the problem.

Please could you share your nomad job file, and if at all possible, verbose logs from a nomad-client running the allocation during the time of the restart?

If you don't feel comfortable sharing these on GitHub, you can also email nomad-oss-debug@hashicorp.com, which is a mailing list readable by all HashiCorp employees, but not by the public.

Thank you!

Sure, no problem :) It's actually happened to a system job that runs Telegraf to collect (of all things) Nomad metrics so, here is the job file (with a few items redacted)

Job file:

job "nomad-metrics" {
    datacenters = [ "FSN", "HEL", "GRA", "PAR" ]
    type = "system"

    update {
        stagger = "1s"
        max_parallel = 1
    }

    group "telegraf" {
        restart {
            attempts = 10
            interval = "5m"
            delay = "25s"
            mode = "delay"
        }

        task "telegraf" {
            vault {
                policies = [ "nomad-certificate-client", "influx-nomadmetrics" ]
            }
            driver = "docker"
            config {
                network_mode = "host"
                userns_mode = "host"
                image = "telegraf:alpine"
                force_pull = true
                args = [
                    "-config",
                    "/local/telegraf.conf"
                ]
            }

            template {
                destination = "local/nomad.crt"
                data = <<EOHCERT
{{ with secret "pki/nomad/issue/client-certificate" "common_name=xxxx.global.nomad" "ttl=604800" }}
{{ .Data.certificate }}{{ end }}
EOHCERT
            }
            template {
                destination = "local/nomad.key"
                data = <<EOHKEY
{{ with secret "pki/nomad/issue/client-certificate" "common_name=xxxx.global.nomad" "ttl=604800" }}
{{ .Data.private_key }}{{ end }}
EOHKEY
            }
            artifact {
                source      = "http://vault-instance-here/v1/pki/nomad/ca/pem"
                destination = "local/nomad-ca.pem"
                mode        = "file"
            }

            template {
                destination = "local/telegraf.conf"
                change_mode = "restart"
                data = <<EOTC
# Adding Client class
# This should be here until https://github.com/hashicorp/nomad/pull/3882 is merged
{{ $node_class := env "node.class" }}
[global_tags]
{{ if $node_class }}
  nomad_client_class = "{{ env "node.class" }}"
{{else}}
  nomad_client_class = "none"
{{ end }}

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "3s"
  precision = ""
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://influxdb.service.consul:8086"]
  database = "nomad_metrics"
{{ with secret "db/influxdb/creds/nomadmetrics" }}
  username = "{{ .Data.username }}"
  password = "{{ .Data.password }}"
{{ end }}
  retention_policy = "autogen"
  timeout = "5s"
  user_agent = "telegraf-{{env "NOMAD_TASK_NAME" }}"
  skip_database_creation = true
[[inputs.prometheus]]
  urls = ["https://{{ env "meta.public_ip" }}:4646/v1/metrics?format=prometheus"]
  tls_ca = "/local/nomad-ca.pem"
  tls_cert = "/local/nomad.crt"
  tls_key = "/local/nomad.key"
  insecure_skip_verify = true
EOTC
            }
        }
    }
}

Here are the log entries relevant to the above job being restarted and failing, unfortunately I can't enable verbose logging at the moment because this is our production environment, so this is the default logs that Nomad tosses into syslog for the time being.

Apr 17 13:10:20 xxx-001 nomad[29089]: client.driver_mgr.docker: started container: driver=docker container_id=ea8510277907a65e41dfe103e43cf9f00697b2f62ee62e17464b2caf4a6eef20
Apr 17 13:10:20 xxx-001 consul[19698]:     2019/04/17 11:10:20 [INFO] agent: Synced service "_nomad-task-42agarg7i5xa2uojvbqsjhzru2wnfaw7"
Apr 17 15:40:40 xxx-001 nomad[29089]:     2019/04/17 13:40:40.377644 [INFO] (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.crt"
Apr 17 15:40:40 xxx-001 nomad[29089]: (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.crt"
Apr 17 15:40:40 xxx-001 nomad[29089]:     2019/04/17 13:40:40.385787 [INFO] (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.key"
Apr 17 15:40:40 xxx-001 nomad[29089]: (runner) rendered "(dynamic)" => "/nomad/data/alloc/20f6c288-ab07-6972-cd71-7ef6c9ebf510/telegraf/local/nomad.key"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.830Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason= delay=0s
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason= delay=0s
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.833Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner.task_hook: failed to start logmon: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.834Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="prestart hook "logmon" failed: rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.834Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason="Error was unrecoverable"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: prestart failed: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf error="prestart hook "logmon" failed: rpc error: code = Unavailable desc = transport is closing"
Apr 17 15:40:41 xxx-001 nomad[29089]: client.alloc_runner.task_runner: not restarting task: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510 task=telegraf reason="Error was unrecoverable"
Apr 17 15:40:41 xxx-001 nomad[29089]:     2019-04-17T13:40:41.836Z [INFO ] client.gc: marking allocation for GC: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510
Apr 17 15:40:41 xxx-001 nomad[29089]: client.gc: marking allocation for GC: alloc_id=20f6c288-ab07-6972-cd71-7ef6c9ebf510

Also as a thing to note, this only seems to happen on jobs with tasks that have network_mode = "host" and userns_mode = "host" set.

@benvanstaveren
Copy link
Author

I can, probably, switch the log level to debug on one of our production boxes and trigger the error, but it has to wait until night time in Europe or I'll have the entire support desk in my office with the pitchforks and torches.

@endocrimes
Copy link
Contributor

@benvanstaveren That would be very useful - I'm also based in the EU so I can pick this up first thing tomorrow if so.

@benvanstaveren
Copy link
Author

@benvanstaveren That would be very useful - I'm also based in the EU so I can pick this up first thing tomorrow if so.

I have emailed it to the address mentioned above as well as the job file, since it's a rather large file and has too much info in it for me to redact. However, an item of interest is that the exact same task in another job (same image, same configuration) that does not run with userns_mode = "host" and network_mode = "host" in the docker driver config section worked fine and got restarted properly.

@benvanstaveren
Copy link
Author

Alright, email has been sent - hope it gives you some clues. It does seem to be tied to having either all or a combination of userns_mode, network_mode and privileged settings in the docker driver config for a task.

@benvanstaveren
Copy link
Author

Okay sorry for the amount of noise but here's what I've managed to fiddle together:

network_mode/host_mode/privileged apparently isn't the culprit - or it may be, but: the job that restarted successfully had labels set in the docker config:

            labels {
                "service_id"            = "${NOMAD_JOB_NAME}/${NOMAD_GROUP_NAME}/${NOMAD_TASK_NAME}"
                "service_job"           = "${NOMAD_JOB_NAME}"
                "log_format"            = "plain"
            }

I have added this same set of labels to the job that failed (the one I sent the details for by email), tried to reproduce the problem, and nothing happened, they restarted fine just like the other one.

@preetapan
Copy link
Contributor

@benvanstaveren while we keep investigating this, wanted to let you know that we made our logmon process a bit more reliable with some fixes in 0.9.1-rc, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would you be able to try this out and let us know if you are still seeing problems with restarts?

@benvanstaveren
Copy link
Author

I can try but it won't be this week, I'll have to do this over a weekend so I don't interrupt production :)

@endocrimes
Copy link
Contributor

@benvanstaveren Thank you!

@mpodkolzin
Copy link

It seems that I'm experiencing the same problem on windows with both 0.90 and 0.9.1-rc1

@benvanstaveren
Copy link
Author

Sorry folks, couldn't pull it off over the weekend, will see if I can make an attempt during a weeknight with the rc1. Do I need to update the entire cluster to rc1 or just the machines running the task that causes the issues?

@endocrimes
Copy link
Contributor

@benvanstaveren just the clients should be fine 👍 - thanks and no problem

@endocrimes
Copy link
Contributor

@mpodkolzin Do you happen to have logs from the clients and also task events involved? - Thanks!

@benvanstaveren
Copy link
Author

I noticed the rc1 has been released as 0.9.1 so I'll put in an upgrade and see if I can trigger the error, probably sometime this week so will inform you all of the result here :)

@notnoop
Copy link
Contributor

notnoop commented Apr 30, 2019

@benvanstaveren btw - 0.9.1 included few fixes in addition to the ones in rc1 - namely, #5616 . Looking at your logs from earlier, I believe this is the case you are hitting. Please try it and let us know

@stale
Copy link

stale bot commented May 30, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed May 30, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants