Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only one task group updates with rolling deploy #3000

Closed
goedelsoup opened this issue Aug 9, 2017 · 11 comments
Closed

only one task group updates with rolling deploy #3000

goedelsoup opened this issue Aug 9, 2017 · 11 comments

Comments

@goedelsoup
Copy link

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu 16.04

Issue

The following job only updates one task of the three deployed:

job "tricorder" {

	meta {
            git_sha = "4cd5c40d4cf0a36bda3479bd859118afcf35e19b2"
        }

	datacenters = [ "us-east-1" ]

	type = "service"

	update {
		max_parallel = 1
	}

	group "web" {

		count = 3

		task "api" {
			driver = "docker"

			env {
				APPLICATION_ENV = "staging"
				JAVA_OPTS = "-Dlogback.configurationFile=/etc/cota/logback.xml"
			}

			config {
				image = "cotalabs/tricorder:1.1.0-SNAPSHOT"
				force_pull = true

				volumes = [ 
					"new/tricorder.conf:/etc/cota/tricorder/tricorder.conf",
					"new/secrets.conf:/etc/cota/secrets.conf",
					"new/logback.xml:/etc/cota/logback.xml"
				]

				port_map {
					http = 9090
				}
			}

			artifact {
				source = "###/tricorder.ctmpl"
			}

			template {
				source = "local/tricorder.ctmpl"
				destination = "new/tricorder.conf"
			}

			artifact {
				source = "###/secrets.ctmpl"
			}

			template {
				source = "local/secrets.ctmpl"
				destination = "new/secrets.conf"
			}

			artifact {
				source = "###/files/logback.xml"
			}

			vault {
				policies = [ "tricorder" ]
			}

			resources {
				cpu    = 500 # MHz
				memory = 2048 # MB

				network {
					mbits = 20

					# Dynamic port allocation
					port "http" {}
				}
			}

			service {
				port = "http"

				tags = [ "urlprefix-###/" ]

				check {
					type     = "http"
					path     = "/status"
					interval = "5s"
					timeout  = "1s"
				}
			}
		}
	}
}

Since this is our staging environment, we are updating the snapshot build and forcing the image pull via force_pull. We attempted to force modify the job definition by injecting the Git commit SHA into a meta field.

The job status shows the following (with only one job version bumped unless we first stop the job:

ID            = tricorder
Name          = tricorder
Submit Date   = 08/09/17 18:04:06 EDT
Type          = service
Priority      = 50
Datacenters   = us-east-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         3        2       47        0

Latest Deployment
ID          = 0e4f72c9
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
web         3        1       0        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
91cb9c24  ec2f14d6  web         86       run      running   08/09/17 18:04:07 EDT
3f0bba0f  ec24e95b  web         85       run      running   08/09/17 18:03:25 EDT
74250188  ec22785b  web         85       stop     complete  08/09/17 18:03:25 EDT
7886d641  ec2bbab7  web         85       run      running   08/09/17 18:03:25 EDT
4943df44  ec2f14d6  web         83       stop     complete  08/09/17 18:00:50 EDT
680340c4  ec24e95b  web         82       stop     complete  08/09/17 17:59:46 EDT
9358cb66  ec2bbab7  web         82       stop     complete  08/09/17 17:59:46 EDT
f79fc41f  ec22785b  web         82       stop     complete  08/09/17 17:59:46 EDT
a9262422  ec24e95b  web         80       stop     complete  08/09/17 17:58:43 EDT
c55eeda6  ec24e95b  web         79       stop     complete  08/09/17 17:31:59 EDT
a36db08e  ec22785b  web         78       stop     complete  08/09/17 17:16:04 EDT
c292c2d8  ec22785b  web         77       stop     complete  08/09/17 16:13:17 EDT
d6b7ac1b  ec2bbab7  web         77       stop     complete  08/09/17 16:13:17 EDT
3f2b7e81  ec2f14d6  web         77       stop     complete  08/09/17 16:13:17 EDT
@goedelsoup
Copy link
Author

Noticing now that these tasks are never marked as healthy when inspected from the nomad client, however, these service checks are passing in Consul.

@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017

Hey only one would be deployed because of MaxParallel = 1.

Do you have any interpolations in your check? That might mean you are getting hit by this: #2984

If not, you can change your clients log level to TRACE and information about why it hasn't transitioned to healthy will be printed. Nomad v0.6.1 will emit events to make debugging this easier.

@goedelsoup
Copy link
Author

No interpolation in the status check. Simply an unauthenticated test against a health endpoint of an HTTP service.

Transitioning to TRACE log level didn't produce much meaningful, mostly periodic Consul KV lookups. Any guidance on what I should be looking for?

@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017

@goedelsoup
Copy link
Author

Resolved. We missed a provisioning process, so our clients were still on v0.5.6 and I don't think the logging came in until v0.6.0. Since this appears to be a break between 0.5/0.6 compatibility, it'd be great if the 0.6 server emitted warn logs when the scenario exists. Thanks for the guidance on working through this.

@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017

Glad you figured it out @goedelsoup

@dadgar dadgar closed this as completed Aug 10, 2017
@tino
Copy link

tino commented Aug 10, 2017

I'm running into the same, server and two clients, all on 0.6.0. But if I understand correctly I'm hitting #2984 right?

check {
          type = "script"
          # Direct curl somehow doesn't work?
          command = "/bin/bash"
          args = ["-c", "curl -sf -H 'Host: mysite.nl' ${NOMAD_ADDR_http}/status/"]
          interval = "10s"
          timeout = "2s"
}

When are patches usually released?

@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017

@tino 0.6.1 will be coming out within 2 weeks. I will push a binary to that issue that resolves this for those who are hitting it.

@tino
Copy link

tino commented Aug 10, 2017

Thanks. What do you mean by "push a binary to that issue"?

@dadgar
Copy link
Contributor

dadgar commented Aug 10, 2017 via email

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants