Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old script health checks come back when upgrading a cluster #6815

Open
cwalsh opened this issue Dec 6, 2019 · 10 comments
Open

Old script health checks come back when upgrading a cluster #6815

cwalsh opened this issue Dec 6, 2019 · 10 comments

Comments

@cwalsh
Copy link

cwalsh commented Dec 6, 2019

When upgrading a Nomad client in a cluster from 0.9.4 to 0.10.1, we see allocations end up with two script checks registered in Consul against them - the new one which passes, and an old one which fails. Allocations running or started on the 0.9.4 nodes in the cluster continue to function normally.

Nomad version

All server nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Some client nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Remaining client nodes: Nomad v0.9.4 (a81aa84)

Operating system and Environment details

Debian Jessie

Issue

Old script-based health checks come back, a problem similar to #5824. It seems to only happen with script checks.

Reproduction steps

Start with a cluster running 0.9.4, with a number of jobs with health check scripts.
Upgrade the server nodes to 0.10.1
Drain a client node
Upgrade nomad on the drained client node to 0.10.1
Restart the nomad job to allow it to run on the newly upgraded node
Wait a few minutes and watch the old script health checks get re-registered against the new allocations.

I can't replicate this in development mode with a single combined server/client, but it reliably happens on our clusters.

@tgross tgross added this to the near-term milestone Dec 6, 2019
@tgross
Copy link
Member

tgross commented Dec 6, 2019

Thanks for reporting this @cwalsh! There were some changes to the script check registration process in 0.10.0 to support task group services and Consul Connect, so that seems like a likely candidate to start our investigation into this bug.

@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@chrisboulton
Copy link

Just as a quick update here (I work with @cwalsh), this doesn't appear to be specific to an upgrade. It appears that under some conditions that we're yet to identify, a new duplicate check is registered in Consul by Nomad, with a different check ID. Nomad stops checking the existing check, which causes the TTL to expire, and the service go critical. I believe at this point, we've ruled out the upgrade because we've performed the usual node drain, data directory purge, dance.

The only thing I've been able to think of is that under some conditions a new check ID is generated, maybe based on a hash of the check which is missing a couple of properties/has different values to what was used to determine the check ID earlier. We haven't been able to trace it back to a specific check configuration.

@chrisboulton
Copy link

Actually, I just noticed #6836, #6916, and #7185 -- all related. We'll retest and see where we get to.

@jimmybigcommerce
Copy link

This is still an issue in 0.10.4

@chrisboulton
Copy link

@tgross (sorry for the direct @), Jimmy is also part of my team and has been the one that's spent the most time on this of late, even with Nomad v0.10.4. I don't think this bug is directly related to upgrades anymore, as new containers scheduled exhibit the same problem, even if you reset a node. Unfortunately we don't have any great reproduction instructions yet -- do you have any suggestions on where we might go from here or what might be useful?

@tgross
Copy link
Member

tgross commented Mar 17, 2020

Can you provide an example jobspec where this happens?

@jimmybigcommerce
Copy link

Thanks Tim - I emailed an example Jobspec to the nomad-oss-debug email address, also added a little more context. Many thanks in advance!

@tgross
Copy link
Member

tgross commented Mar 26, 2020

Thanks @jimmybigcommerce, that's a big help. I've had a look at the jobspec JSON you sent and I've pulled out a redacted version and turned it into HCL which I think illustrates the issue:

service {
  port = "myport"

  check {
    type    = "script"
    command = "/usr/lib/check_http"

    args = [
      "-I",
      "localhost",
      "-p",
      "${NOMAD_PORT_myport}",
      "container",
    ]

    interval = "30s"
    timeout  = "15s"
  }
}

As @chrisboulton pointed out above, we use a hash of the service and check to register the IDs:

maybe based on a hash of the check which is missing a couple of properties/has different values to what was used to determine the check ID earlier.

In 0.10.0 we made changes to support group-level services, which included a rework of script checks so that they could be defined at the group level but executed inside a task container. But there was a bug in this where services/checks that used interpolation would be registered before interpolation, and then executed after interpolation. In 0.10.2 I fixed this bug ( #6586). But I missed adding the interpolation to script checks, and so this was fixed in 0.10.4 (#6916).

However, there's still a lurking gotcha. If we do interpolation before registering and interpolation, then checks that have interpolation that varies on a per allocation basis will end up with different checks to register for the same service.

Which is exactly the case here we have with using the NOMAD_PORT_myport interpolation in the script check. Each allocation can potentially get a different NOMAD_PORT_myport and end up registering different checks.

We definitely should fix this before and make it more intuitive... I think I want to pull in some colleagues (maybe @nickethier and @schmichael?) for their thoughts as this section of code seems to be a bit of a bug farm.

In the meantime as a workaround, note that the script check is being executed inside the task container. So you should be able to make the script check directly to the port that the application is listening on, rather than to the NOMAD_PORT_myport.

@chrisboulton
Copy link

@tgross Thanks a bunch for tracking this down! The explanation makes a bunch of sense especially after a couple of scours over the code in this area I've made of late.

@cwalsh: Tim's workaround here is probably relatively trivial for us to make on our end, and it's a good callout given we know the actual value of the port we need to check against. A change there on our would let us forge ahead with the Nomad upgrade. Did we want to go down that path for now? I have a feeling the reason we're checking against the mapped port is purely a leftover from days gone by.

@jimmybigcommerce
Copy link

Thank you @tgross greatly appreciated!

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from the unscheduled milestone Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants