Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update health_check and auto_revert don't seem to work #3016

Closed
tino opened this issue Aug 12, 2017 · 5 comments
Closed

update health_check and auto_revert don't seem to work #3016

tino opened this issue Aug 12, 2017 · 5 comments

Comments

@tino
Copy link

tino commented Aug 12, 2017

Nomad version

Nomad v0.6.0-dev (1f3966e+CHANGES)
(from #2969)

Operating system and Environment details

Docker alpine

Issue

With this config:

  update {
    stagger = "10s"
    max_parallel = 1
    # only move forward if nginx starts, so we don't throw everything down with
    # a syntax error.
    health_check = "checks"
    healthy_deadline = "30s"
    auto_revert = true
  }

I expect a failing configuration to not be deployed across multiple machines, but be reverted after failing a single try.

Reproduction steps

# file ngtest.nomad
job "ngtest2" {
  datacenters = ["NL1"]
  type = "system"

  update {
    stagger = "10s"
    max_parallel = 1
    # only move forward if nginx starts, so we don't throw everything down with
    # a syntax error.
    health_check = "checks"
    healthy_deadline = "30s"
    auto_revert = true
  }


  group "nginx" {
    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:1.13.3-alpine"
        command = "/usr/sbin/nginx"
        args = ["-c", "/local/nginx.conf", "-g", "daemon off;"]

        port_map {
          http = 80
        }

      }

      service {
        port = "http"
        tags = ["nginx"]
        check = {
          type = "http"
          name = "nginx-status"
          port = "http"
          path = "/nginx_status"
          timeout = "1s"
          interval = "5s"
        }
      }

      template {
        destination = "local/nginx.conf"
        change_mode = "signal"
        change_signal = "SIGHUP"
        data = <<EOH
          events {
              worker_connections  2048;
              use                 epoll;
              multi_accept        on;
          }
          http {

              include       /etc/nginx/mime.types;

              server {
                  listen      80  default_server;
                  server_name _;

                  location / {
                      return 200;
                  }

                  location /nginx_status {
                      stub_status on;
                      access_log   off;
                      allow 10.0.0.0/16;
                      deny all;
                  }
              }
          }
        EOH
      }

      resources {
        memory = 100
        cpu = 250
        network {
          port "http" {
            static = 8080
          }
        }
      }
    }
  }
}
  1. First run: nomad run ngtest.nomad
  2. Drop a trailing ; in the nginx.conf to make it invalid
  3. Run nomad run ngtest.nomad

=> both end up failing.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

After first run:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:21:16 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       0         2        0       8         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
ef3fa5ba  fd664622  nginx       4        run      running   08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        run      running   08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST

after 2nd run:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:22:15 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       2         0        0       10        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
5aeb7607  bfbeae63  nginx       5        run      pending   08/12/17 21:22:25 CEST
1931f8f7  fd664622  nginx       5        run      pending   08/12/17 21:22:15 CEST
ef3fa5ba  fd664622  nginx       4        stop     complete  08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        stop     complete  08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST
⌘ nomad alloc-status 5aeb
ID                  = 5aeb7607
Eval ID             = 04935ec3
Name                = ngtest2.nginx[0]
Node ID             = bfbeae63
Job ID              = ngtest2
Job Version         = 5
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 08/12/17 21:22:25 CEST

Task "nginx" is "pending"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
250 MHz  100 MiB  300 MiB  0     http: 10.0.0.56:8080

Task Events:
Started At     = 08/12/17 19:22:37 UTC
Finished At    = N/A
Total Restarts = 2
Last Restart   = 08/12/17 19:22:37 UTC

Recent Events:
Time                    Type        Description
08/12/17 21:22:37 CEST  Restarting  Task restarting in 16.94493448s
08/12/17 21:22:37 CEST  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
08/12/17 21:22:37 CEST  Started     Task started by client
08/12/17 21:22:19 CEST  Restarting  Task restarting in 17.547612253s
08/12/17 21:22:19 CEST  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
08/12/17 21:22:19 CEST  Started     Task started by client
08/12/17 21:22:18 CEST  Task Setup  Building Task Directory
08/12/17 21:22:18 CEST  Received    Task received by client
@tino
Copy link
Author

tino commented Aug 12, 2017

Even when I add:

  restart {
    mode = "fail"
  }

nothing is reverted after ending up in "failed" state:

⌘ nomad status ngtest2
ID            = ngtest2
Name          = ngtest2
Submit Date   = 08/12/17 21:28:14 CEST
Type          = system
Priority      = 50
Datacenters   = NL1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       0         0        4       12        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
3102f0ca  bfbeae63  nginx       8        run      failed    08/12/17 21:28:24 CEST
643c6137  fd664622  nginx       8        run      failed    08/12/17 21:28:14 CEST
a24d9dd7  fd664622  nginx       7        stop     complete  08/12/17 21:28:03 CEST
c3b9b8b3  bfbeae63  nginx       7        stop     complete  08/12/17 21:28:03 CEST
5aeb7607  bfbeae63  nginx       6        run      failed    08/12/17 21:22:25 CEST
1931f8f7  fd664622  nginx       6        run      failed    08/12/17 21:22:15 CEST
ef3fa5ba  fd664622  nginx       4        stop     complete  08/12/17 21:21:26 CEST
e02e8b61  bfbeae63  nginx       4        stop     complete  08/12/17 21:21:16 CEST
b3570b1a  fd664622  nginx       3        stop     complete  08/12/17 21:10:17 CEST
d3bcbe09  bfbeae63  nginx       3        stop     complete  08/12/17 21:10:07 CEST
1f8e99f0  bfbeae63  nginx       2        stop     complete  08/12/17 21:09:37 CEST
ba3aa3f2  fd664622  nginx       2        stop     complete  08/12/17 21:09:27 CEST
ef2bf61e  fd664622  nginx       1        stop     complete  08/12/17 21:08:43 CEST
4eb579cd  bfbeae63  nginx       1        stop     complete  08/12/17 21:08:33 CEST
7ebdb3e7  bfbeae63  nginx       0        stop     complete  08/12/17 21:06:54 CEST
e8776f11  fd664622  nginx       0        stop     complete  08/12/17 21:06:54 CEST

@dadgar
Copy link
Contributor

dadgar commented Aug 14, 2017

@tino Looks like you are running a system job. Unfortunately this feature is only available on service jobs at the moment. The docs have been updated and the website should be pushed soon.

https://github.com/hashicorp/nomad/blob/master/website/source/docs/job-specification/update.html.md#update-stanza

@dadgar dadgar closed this as completed Aug 14, 2017
@tino
Copy link
Author

tino commented Aug 15, 2017

Ah, okay, that explains it!

Is there anything I can do now to prevent deploying a failing configuration everywhere as I was trying to accomplish?

And is this something to be expected in a 0.6.x or more 0.7/8 release?

@dadgar
Copy link
Contributor

dadgar commented Aug 15, 2017

@tino you could duplicate the group and add a constraint to one group to run only on one node and on the other to not run on that node to essentially manually canary. As for bringing the new update stanza it is more like 0.7/0.8.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants