update health_check and auto_revert don't seem to work #3016

tino · 2017-08-12T19:16:13Z

Nomad version

Nomad v0.6.0-dev (1f3966e+CHANGES)
(from #2969)

Operating system and Environment details

Docker alpine

Issue

With this config:

  update {
    stagger = "10s"
    max_parallel = 1
    # only move forward if nginx starts, so we don't throw everything down with
    # a syntax error.
    health_check = "checks"
    healthy_deadline = "30s"
    auto_revert = true
  }

I expect a failing configuration to not be deployed across multiple machines, but be reverted after failing a single try.

Reproduction steps

# file ngtest.nomad
job "ngtest2" {
  datacenters = ["NL1"]
  type = "system"

  update {
    stagger = "10s"
    max_parallel = 1
    # only move forward if nginx starts, so we don't throw everything down with
    # a syntax error.
    health_check = "checks"
    healthy_deadline = "30s"
    auto_revert = true
  }


  group "nginx" {
    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:1.13.3-alpine"
        command = "/usr/sbin/nginx"
        args = ["-c", "/local/nginx.conf", "-g", "daemon off;"]

        port_map {
          http = 80
        }

      }

      service {
        port = "http"
        tags = ["nginx"]
        check = {
          type = "http"
          name = "nginx-status"
          port = "http"
          path = "/nginx_status"
          timeout = "1s"
          interval = "5s"
        }
      }

      template {
        destination = "local/nginx.conf"
        change_mode = "signal"
        change_signal = "SIGHUP"
        data = <<EOH
          events {
              worker_connections  2048;
              use                 epoll;
              multi_accept        on;
          }
          http {

              include       /etc/nginx/mime.types;

              server {
                  listen      80  default_server;
                  server_name _;

                  location / {
                      return 200;
                  }

                  location /nginx_status {
                      stub_status on;
                      access_log   off;
                      allow 10.0.0.0/16;
                      deny all;
                  }
              }
          }
        EOH
      }

      resources {
        memory = 100
        cpu = 250
        network {
          port "http" {
            static = 8080
          }
        }
      }
    }
  }
}

First run: nomad run ngtest.nomad
Drop a trailing ; in the nginx.conf to make it invalid
Run nomad run ngtest.nomad

=> both end up failing.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

After first run:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:21:16 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       0         2        0       8         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
ef3fa5ba  fd664622  nginx       4        run      running   08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        run      running   08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST

after 2nd run:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:22:15 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       2         0        0       10        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
5aeb7607  bfbeae63  nginx       5        run      pending   08/12/17 21:22:25 CEST
1931f8f7  fd664622  nginx       5        run      pending   08/12/17 21:22:15 CEST
ef3fa5ba  fd664622  nginx       4        stop     complete  08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        stop     complete  08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST

⌘ nomad alloc-status 5aeb
ID                  = 5aeb7607
Eval ID             = 04935ec3
Name                = ngtest2.nginx[0]
Node ID             = bfbeae63
Job ID              = ngtest2
Job Version         = 5
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 08/12/17 21:22:25 CEST

Task "nginx" is "pending"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
250 MHz  100 MiB  300 MiB  0     http: 10.0.0.56:8080

Task Events:
Started At     = 08/12/17 19:22:37 UTC
Finished At    = N/A
Total Restarts = 2
Last Restart   = 08/12/17 19:22:37 UTC

Recent Events:
Time                    Type        Description
08/12/17 21:22:37 CEST  Restarting  Task restarting in 16.94493448s
08/12/17 21:22:37 CEST  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
08/12/17 21:22:37 CEST  Started     Task started by client
08/12/17 21:22:19 CEST  Restarting  Task restarting in 17.547612253s
08/12/17 21:22:19 CEST  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
08/12/17 21:22:19 CEST  Started     Task started by client
08/12/17 21:22:18 CEST  Task Setup  Building Task Directory
08/12/17 21:22:18 CEST  Received    Task received by client

The text was updated successfully, but these errors were encountered:

tino · 2017-08-12T19:30:35Z

Even when I add:

  restart {
    mode = "fail"
  }

nothing is reverted after ending up in "failed" state:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:28:14 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       0         0        4       12        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
3102f0ca  bfbeae63  nginx       8        run      failed    08/12/17 21:28:24 CEST
643c6137  fd664622  nginx       8        run      failed    08/12/17 21:28:14 CEST
a24d9dd7  fd664622  nginx       7        stop     complete  08/12/17 21:28:03 CEST
c3b9b8b3  bfbeae63  nginx       7        stop     complete  08/12/17 21:28:03 CEST
5aeb7607  bfbeae63  nginx       6        run      failed    08/12/17 21:22:25 CEST
1931f8f7  fd664622  nginx       6        run      failed    08/12/17 21:22:15 CEST
ef3fa5ba  fd664622  nginx       4        stop     complete  08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        stop     complete  08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST

dadgar · 2017-08-14T21:00:19Z

@tino Looks like you are running a system job. Unfortunately this feature is only available on service jobs at the moment. The docs have been updated and the website should be pushed soon.

https://github.com/hashicorp/nomad/blob/master/website/source/docs/job-specification/update.html.md#update-stanza

tino · 2017-08-15T05:59:48Z

Ah, okay, that explains it!

Is there anything I can do now to prevent deploying a failing configuration everywhere as I was trying to accomplish?

And is this something to be expected in a 0.6.x or more 0.7/8 release?

dadgar · 2017-08-15T17:13:42Z

@tino you could duplicate the group and add a constraint to one group to run only on one node and on the other to not run on that node to essentially manually canary. As for bringing the new update stanza it is more like 0.7/0.8.

github-actions · 2022-12-10T02:15:28Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Aug 14, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update health_check and auto_revert don't seem to work #3016

update health_check and auto_revert don't seem to work #3016

tino commented Aug 12, 2017 •

edited

Loading

tino commented Aug 12, 2017

dadgar commented Aug 14, 2017

tino commented Aug 15, 2017

dadgar commented Aug 15, 2017

github-actions bot commented Dec 10, 2022

update health_check and auto_revert don't seem to work #3016

update health_check and auto_revert don't seem to work #3016

Comments

tino commented Aug 12, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tino commented Aug 12, 2017

dadgar commented Aug 14, 2017

tino commented Aug 15, 2017

dadgar commented Aug 15, 2017

github-actions bot commented Dec 10, 2022

tino commented Aug 12, 2017 •

edited

Loading