Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.5.0] Job scrolling upgrade compatibility failure #16307

Closed
chenjpu opened this issue Mar 3, 2023 · 8 comments · Fixed by #16402
Closed

[1.5.0] Job scrolling upgrade compatibility failure #16307

chenjpu opened this issue Mar 3, 2023 · 8 comments · Fixed by #16402

Comments

@chenjpu
Copy link

chenjpu commented Mar 3, 2023

Nomad version

1.5.0

Issue

1.Upgrade from 1.4.4 to 1.5.0
2.Release a new version of the job
3.image

job

job "srv-third" {
  datacenters = ["dc1"]
  type        = "service"
  namespace   = "ai-dev"

  update {
    max_parallel     = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    auto_revert      = true
    auto_promote     = true
    canary           = 1
  }

  reschedule {
    attempts       = 15
    interval       = "1h"
    delay          = "15s"
    delay_function = "exponential"
    max_delay      = "120s"
    unlimited      = false
  }

  group "service" {

    restart {
      interval = "3m"
      attempts = 3
      delay    = "15s"
      mode     = "delay"
    }

    service {
      name         = "${NOMAD_JOB_NAME}"
      port         = "http"
      address_mode = "host"

      check {
        type           = "http"
        port           = "http"
        path           = "/v1.0/healthz"
        interval       = "12s"
        timeout        = "6s"

        check_restart {
          limit           = 3
          grace           = "10s"
        }
      }
    }

    task "app" {
      driver = "docker"
      config {
        image   = "alpine:3.15"
        command = "local/app"
        ports   = ["app"]
        args = [

        ]
      }
    }


    task "daprd" {
      lifecycle {
        hook = "poststart"
        sidecar = true
      }
      driver = "docker"

      config {
        image   = "alpine:3.15"
        ports   = ["http", "grpc"]
        command = "local/daprd"
        args = [
        ...
        ]
      }
    }
  }
}
@chenjpu
Copy link
Author

chenjpu commented Mar 3, 2023

Specify health_check = "task_states" for update block, rolling upgrade is normal. :)

@gmichalec-pandora
Copy link

gmichalec-pandora commented Mar 3, 2023

we are seeing a similar issue after upgrading our clients to 1.5.0

here is our job spec:

{
  "job": {
    "savagecloud-monitoring-integration": {
      "constraint": {
        "attribute": "${attr.kernel.name}",
        "value": "linux"
      },
      "datacenters": [
        "integration"
      ],
      "group": [
        {
          "savagecloud-monitoring-group": {
            "count": 2,
            "restart": {
              "attempts": 2,
              "delay": "25s",
              "interval": "2m",
              "mode": "fail"
            },
            "task": [
              {
                "savagecloud-monitoring": {
                  "config": {
                    "image": "harbor-registry.savagebeast.com/savagecloud-monitoring/savagecloud-monitoring:master-latest",
                    "port_map": {
                      "web": 3000
                    }
                  },
                  "driver": "docker",
                  "env": {
                    "COLO": "integration"
                  },
                  "resources": {
                    "cpu": 300,
                    "memory": 100,
                    "network": {
                      "mbits": 1,
                      "port": {
                        "web": {}
                      }
                    }
                  },
                  "service": [
                    {
                      "canary_tags": [
                        "project--savagecloud-monitoring",
                        "docker_tag--master-latest",
                        "nomad_job--${NOMAD_JOB_NAME}",
                        "nomad-alloc--${NOMAD_ALLOC_ID}",
                        "nomad_task--${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}",
                        "canary--true"
                      ],
                      "check": {
                        "interval": "10s",
                        "name": "${NOMAD_TASK_NAME} https",
                        "path": "/",
                        "protocol": "https",
                        "timeout": "2s",
                        "type": "http"
                      },
                      "name": "savagecloud-monitoring",
                      "port": "web",
                      "tags": [
                        "env--integration",
                        "integration",
                        "urlprefix-savagecloud-monitoring.docker.integration.savagebeast.com/",
                        "probe-check-https--savagecloud-monitoring.docker.integration.savagebeast.com",
                        "project--savagecloud-monitoring",
                        "docker_tag--master-latest",
                        "nomad_job--${NOMAD_JOB_NAME}",
                        "nomad-alloc--${NOMAD_ALLOC_ID}",
                        "nomad_task--${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}"
                      ]
                    }
                  ],
                  "template": [
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.certificate }}\n{{ range .Data.ca_chain }}{{ . }}\n{{ end }}{{ end }}",
                      "destination": "/secrets/service.consul.crt",
                      "splay": "1h"
                    },
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.private_key }}{{ end }}",
                      "destination": "/secrets/service.consul.key",
                      "splay": "1h"
                    }
                  ],
                  "vault": {
                    "policies": [
                      "savagecloud-monitoring"
                    ]
                  }
                }
              }
            ]
          }
        }
      ],
      "meta": {
        "BRANCH": "master",
        "DEPLOY_MODE": "ci",
        "DOCKER_TAG": "master-latest",
        "GIT_REPO": "ssh://git@bitbucket.savagebeast.com:2222/sysad/savagecloud-monitoring.git",
        "LEVANT_FILE": "integration.yml",
        "NOMAD_FILE": "nomad.hcl",
        "PROJECT": "savagecloud-monitoring"
      },
      "namespace": "savagecloud-monitoring",
      "region": "integration",
      "type": "service",
      "update": {
        "auto_revert": false,
        "healthy_deadline": "30s",
        "max_parallel": 1,
        "min_healthy_time": "2s",
        "stagger": "30s"
      }
    }
  }
}

and the relevant client logs

Mar 02 22:33:00 dc6-docker4 nomad[11706]: {"@level":"info","@message":"created container","@module":"client.driver_mgr.docker","@timestamp":"2023-03-02T22:33:00.670572-08:00","container_id":"ace752d3b334879e7563bd236e20b37285d0ccd53ccb3d12ea0e6c0932bfc41e","driver":"docker"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:01.271926-08:00","duration":805120,"method":"GET","path":"/v1/agent/self"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:01.287578-08:00","duration":11104645,"method":"GET","path":"/v1/node/06aceb1e-74db-4e5f-3282-87e3ecacb7d1/allocations"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"info","@message":"started container","@module":"client.driver_mgr.docker","@timestamp":"2023-03-02T22:33:01.382445-08:00","container_id":"ace752d3b334879e7563bd236e20b37285d0ccd53ccb3d12ea0e6c0932bfc41e","driver":"docker"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"warn","@message":"plugin configured with a nil SecureConfig","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2023-03-02T22:33:01.382544-08:00","driver":"docker"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"starting plugin","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2023-03-02T22:33:01.382560-08:00","args":["/usr/local/bin/nomad","docker_logger"],"driver":"docker","path":"/usr/local/bin/nomad"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"plugin started","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2023-03-02T22:33:01.383340-08:00","driver":"docker","path":"/usr/local/bin/nomad","pid":17976}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"waiting for RPC address","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2023-03-02T22:33:01.383414-08:00","driver":"docker","path":"/usr/local/bin/nomad"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"plugin address","@module":"docker_logger","@timestamp":"2023-03-02T22:33:01.404340-08:00","address":"/tmp/plugin2392900290","driver":"docker","network":"unix","timestamp":"2023-03-02T22:33:01.404-0800"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"using plugin","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2023-03-02T22:33:01.404509-08:00","driver":"docker","version":2}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"using client connection initialized from environment","@module":"docker_logger","@timestamp":"2023-03-02T22:33:01.405622-08:00","driver":"docker","timestamp":"2023-03-02T22:33:01.405-0800"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-03-02T22:33:01.457554-08:00","alloc_id":"2cc0bf7a-6397-678b-d96c-850612662efd","failed":false,"msg":"Task started by client","task":"savagecloud-monitoring","type":"Started"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:01.483314-08:00","duration":1000090,"method":"GET","path":"/v1/allocation/2cc0bf7a-6397-678b-d96c-850612662efd"}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"sync complete","@module":"consul.sync","@timestamp":"2023-03-02T22:33:01.635170-08:00","deregistered_checks":0,"deregistered_services":0,"registered_checks":0,"registered_services":1}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2023-03-02T22:33:01.719459-08:00","filtered":22,"index":35241501,"pulled":0,"total":22}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2023-03-02T22:33:01.719650-08:00","added":0,"ignored":22,"removed":0,"updated":0}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2023-03-02T22:33:01.719727-08:00","added":0,"errors":0,"ignored":22,"removed":0,"updated":0}
Mar 02 22:33:01 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:01.730833-08:00","duration":6612904209,"method":"GET","path":"/v1/node/06aceb1e-74db-4e5f-3282-87e3ecacb7d1/allocations?index=35241494"}
Mar 02 22:33:02 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:02.550210-08:00","duration":3937498,"method":"GET","path":"/v1/metrics?format=prometheus"}
Mar 02 22:33:05 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:05.577322-08:00","duration":2151289,"method":"GET","path":"/v1/agent/health?type=client"}
Mar 02 22:33:15 dc6-docker4 nomad[11706]: {"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-03-02T22:33:15.578316-08:00","duration":161490,"method":"GET","path":"/v1/agent/health?type=client"}
Mar 02 22:33:21 dc6-docker4 nomad[11706]: {"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-03-02T22:33:21.715708-08:00","alloc_id":"2cc0bf7a-6397-678b-d96c-850612662efd","failed":false,"msg":"Task not running for min_healthy_time of 2s by healthy_deadline of 30s","task":"savagecloud-monitoring","type":"Alloc Unhealthy"}

as far as i can tell, the consul health checks are passing the entire time without issue (from consul's point of view). I'm not seeing anything in our consul logs regarding this alloc's health checks

@gmichalec-pandora
Copy link

gmichalec-pandora commented Mar 3, 2023

interesting - I updated by job spec to remove the deprecated "task network resources " and update those into the group network block, and now the job is working as expected. We knew that the old syntax was deprecated, but it was still working correctly up until this version. It might be nice to add a note in the upgrade guide regarding this

Updated (working) job spec:

{
  "job": {
    "savagecloud-monitoring-integration": {
      "constraint": {
        "attribute": "${attr.kernel.name}",
        "value": "linux"
      },
      "datacenters": [
        "integration"
      ],
      "group": [
        {
          "savagecloud-monitoring-group": {
            "count": 2,
            "network": {
              "port": [
                {
                  "web": {
                    "to": "3000"
                  }
                }
              ]
            },
            "restart": {
              "attempts": 2,
              "delay": "25s",
              "interval": "2m",
              "mode": "fail"
            },
            "service": [
              {
                "canary_tags": [
                  "project--savagecloud-monitoring",
                  "docker_tag--master-latest",
                  "nomad_job--${NOMAD_JOB_NAME}",
                  "nomad-alloc--${NOMAD_ALLOC_ID}",
                  "canary--true"
                ],
                "check": {
                  "interval": "10s",
                  "name": "savagecloud-monitoring https",
                  "path": "/",
                  "protocol": "https",
                  "timeout": "2s",
                  "type": "http"
                },
                "name": "savagecloud-monitoring",
                "port": "web",
                "tags": [
                  "env--integration",
                  "integration",
                  "urlprefix-savagecloud-monitoring.docker.integration.savagebeast.com/",
                  "probe-check-https--savagecloud-monitoring.docker.integration.savagebeast.com",
                  "project--savagecloud-monitoring",
                  "docker_tag--master-latest",
                  "nomad_job--${NOMAD_JOB_NAME}",
                  "nomad-alloc--${NOMAD_ALLOC_ID}"
                ]
              }
            ],
            "task": [
              {
                "savagecloud-monitoring": {
                  "config": {
                    "image": "harbor-registry.savagebeast.com/savagecloud-monitoring/savagecloud-monitoring:master-latest",
                    "ports": [
                      "web"
                    ]
                  },
                  "driver": "docker",
                  "env": {
                    "COLO": "integration"
                  },
                  "resources": {
                    "cpu": 300,
                    "memory": 100
                  },
                  "template": [
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.certificate }}\n{{ range .Data.ca_chain }}{{ . }}\n{{ end }}{{ end }}",
                      "destination": "/secrets/service.consul.crt",
                      "splay": "1h"
                    },
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.private_key }}{{ end }}",
                      "destination": "/secrets/service.consul.key",
                      "splay": "1h"
                    }
                  ],
                  "vault": {
                    "policies": [
                      "savagecloud-monitoring"
                    ]
                  }
                }
              }
            ]
          }
        }
      ],
      "meta": {
        "BRANCH": "master",
        "DEPLOY_MODE": "ci",
        "DOCKER_TAG": "master-latest",
        "GIT_REPO": "ssh://git@bitbucket.savagebeast.com:2222/sysad/savagecloud-monitoring.git",
        "LEVANT_FILE": "integration.yml",
        "NOMAD_FILE": "nomad.hcl",
        "PROJECT": "savagecloud-monitoring"
      },
      "namespace": "savagecloud-monitoring",
      "region": "integration",
      "type": "service",
      "update": {
        "auto_revert": false,
        "healthy_deadline": "30s",
        "max_parallel": 1,
        "min_healthy_time": "2s",
        "stagger": "30s"
      }
    }
  }
}

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 3, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Mar 8, 2023

Hi @chenjpu and @gmichalec-pandora 👋

I have not being able to reproduce this yet. Would you be able to describe how your upgrade process was done? Did you update servers before clients? And which version of Consul are you running?

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 8, 2023

Ah! I think this may be related to #16382. I didn't notice that you also have a dynamic service name name = "${NOMAD_JOB_NAME}".

@gmichalec-pandora do you happen to also have a dynamic service name? Or was the network block update enough to fix the problem for you?

@gmichalec-pandora
Copy link

oh - yes! just tested by leaving the services/network config at the task level, but removing the ${NOMAD_TASK_NAME} from the check name in my service config, and the deploy worked fine! Good eye on id-ing that as the issue!

@gmichalec-pandora
Copy link

to be clear, my service name was not dynamic, but the check name was:

@@ -59,7 +59,7 @@ job "[[.job.name]]" {
         tags = ["env--[[.tags.env]]", "[[.tags.env]]", "urlprefix-savagecloud-monitoring.docker.[[.job.region]].savagebeast.com/", "probe-check-https--savagecloud-monitoring.docker.[[.job.region]].savagebeast.com"]
         port = "web"
         check {
-          name = "${NOMAD_TASK_NAME} https"
+          name = "savagecloud-monitoring https"
           type = "http"
           interval = "10s"
           timeout = "2s"

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 9, 2023

Ah nice, thanks for the confirmation. I have a PR up to fix this problem 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

3 participants