Old script health checks come back when upgrading a cluster #6815

cwalsh · 2019-12-06T03:42:36Z

When upgrading a Nomad client in a cluster from 0.9.4 to 0.10.1, we see allocations end up with two script checks registered in Consul against them - the new one which passes, and an old one which fails. Allocations running or started on the 0.9.4 nodes in the cluster continue to function normally.

Nomad version

All server nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Some client nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Remaining client nodes: Nomad v0.9.4 (a81aa84)

Operating system and Environment details

Debian Jessie

Issue

Old script-based health checks come back, a problem similar to #5824. It seems to only happen with script checks.

Reproduction steps

Start with a cluster running 0.9.4, with a number of jobs with health check scripts.
Upgrade the server nodes to 0.10.1
Drain a client node
Upgrade nomad on the drained client node to 0.10.1
Restart the nomad job to allow it to run on the newly upgraded node
Wait a few minutes and watch the old script health checks get re-registered against the new allocations.

I can't replicate this in development mode with a single combined server/client, but it reliably happens on our clusters.

The text was updated successfully, but these errors were encountered:

tgross · 2019-12-06T14:04:59Z

Thanks for reporting this @cwalsh! There were some changes to the script check registration process in 0.10.0 to support task group services and Consul Connect, so that seems like a likely candidate to start our investigation into this bug.

chrisboulton · 2020-03-12T17:56:56Z

Just as a quick update here (I work with @cwalsh), this doesn't appear to be specific to an upgrade. It appears that under some conditions that we're yet to identify, a new duplicate check is registered in Consul by Nomad, with a different check ID. Nomad stops checking the existing check, which causes the TTL to expire, and the service go critical. I believe at this point, we've ruled out the upgrade because we've performed the usual node drain, data directory purge, dance.

The only thing I've been able to think of is that under some conditions a new check ID is generated, maybe based on a hash of the check which is missing a couple of properties/has different values to what was used to determine the check ID earlier. We haven't been able to trace it back to a specific check configuration.

chrisboulton · 2020-03-12T18:00:26Z

Actually, I just noticed #6836, #6916, and #7185 -- all related. We'll retest and see where we get to.

jimmybigcommerce · 2020-03-13T01:31:25Z

This is still an issue in 0.10.4

chrisboulton · 2020-03-16T23:49:29Z

@tgross (sorry for the direct @), Jimmy is also part of my team and has been the one that's spent the most time on this of late, even with Nomad v0.10.4. I don't think this bug is directly related to upgrades anymore, as new containers scheduled exhibit the same problem, even if you reset a node. Unfortunately we don't have any great reproduction instructions yet -- do you have any suggestions on where we might go from here or what might be useful?

tgross · 2020-03-17T12:46:54Z

Can you provide an example jobspec where this happens?

jimmybigcommerce · 2020-03-25T05:04:22Z

Thanks Tim - I emailed an example Jobspec to the nomad-oss-debug email address, also added a little more context. Many thanks in advance!

tgross · 2020-03-26T12:59:16Z

Thanks @jimmybigcommerce, that's a big help. I've had a look at the jobspec JSON you sent and I've pulled out a redacted version and turned it into HCL which I think illustrates the issue:

service {
  port = "myport"

  check {
    type    = "script"
    command = "/usr/lib/check_http"

    args = [
      "-I",
      "localhost",
      "-p",
      "${NOMAD_PORT_myport}",
      "container",
    ]

    interval = "30s"
    timeout  = "15s"
  }
}

As @chrisboulton pointed out above, we use a hash of the service and check to register the IDs:

maybe based on a hash of the check which is missing a couple of properties/has different values to what was used to determine the check ID earlier.

In 0.10.0 we made changes to support group-level services, which included a rework of script checks so that they could be defined at the group level but executed inside a task container. But there was a bug in this where services/checks that used interpolation would be registered before interpolation, and then executed after interpolation. In 0.10.2 I fixed this bug ( #6586). But I missed adding the interpolation to script checks, and so this was fixed in 0.10.4 (#6916).

However, there's still a lurking gotcha. If we do interpolation before registering and interpolation, then checks that have interpolation that varies on a per allocation basis will end up with different checks to register for the same service.

Which is exactly the case here we have with using the NOMAD_PORT_myport interpolation in the script check. Each allocation can potentially get a different NOMAD_PORT_myport and end up registering different checks.

We definitely should fix this before and make it more intuitive... I think I want to pull in some colleagues (maybe @nickethier and @schmichael?) for their thoughts as this section of code seems to be a bit of a bug farm.

In the meantime as a workaround, note that the script check is being executed inside the task container. So you should be able to make the script check directly to the port that the application is listening on, rather than to the NOMAD_PORT_myport.

chrisboulton · 2020-03-26T18:46:37Z

@tgross Thanks a bunch for tracking this down! The explanation makes a bunch of sense especially after a couple of scours over the code in this area I've made of late.

@cwalsh: Tim's workaround here is probably relatively trivial for us to make on our end, and it's a good callout given we know the actual value of the port we need to check against. A change there on our would let us forge ahead with the Nomad upgrade. Did we want to go down that path for now? I have a feeling the reason we're checking against the mapped port is purely a leftover from days gone by.

jimmybigcommerce · 2020-03-26T22:08:46Z

Thank you @tgross greatly appreciated!

tgross added type/bug theme/consul labels Dec 6, 2019

tgross added this to the near-term milestone Dec 6, 2019

tgross modified the milestones: near-term, unscheduled Jan 9, 2020

tgross added the stage/needs-discussion label Jan 26, 2021

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from the unscheduled milestone Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

jrasell added the theme/service-discovery/consul label Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old script health checks come back when upgrading a cluster #6815

Old script health checks come back when upgrading a cluster #6815

cwalsh commented Dec 6, 2019

tgross commented Dec 6, 2019

chrisboulton commented Mar 12, 2020

chrisboulton commented Mar 12, 2020

jimmybigcommerce commented Mar 13, 2020

chrisboulton commented Mar 16, 2020

tgross commented Mar 17, 2020

jimmybigcommerce commented Mar 25, 2020

tgross commented Mar 26, 2020

chrisboulton commented Mar 26, 2020

jimmybigcommerce commented Mar 26, 2020

Old script health checks come back when upgrading a cluster #6815

Old script health checks come back when upgrading a cluster #6815

Comments

cwalsh commented Dec 6, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

tgross commented Dec 6, 2019

chrisboulton commented Mar 12, 2020

chrisboulton commented Mar 12, 2020

jimmybigcommerce commented Mar 13, 2020

chrisboulton commented Mar 16, 2020

tgross commented Mar 17, 2020

jimmybigcommerce commented Mar 25, 2020

tgross commented Mar 26, 2020

chrisboulton commented Mar 26, 2020

jimmybigcommerce commented Mar 26, 2020