Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templates referencing Consul service pools stop receiving updates #11558

Closed
dpn opened this issue Nov 22, 2021 · 4 comments
Closed

Templates referencing Consul service pools stop receiving updates #11558

dpn opened this issue Nov 22, 2021 · 4 comments

Comments

@dpn
Copy link

dpn commented Nov 22, 2021

Nomad version

/ # nomad version
Nomad v1.0.6 (592cd4565bf726408a03482da2c9fd8a3a1015cf)

Operating system and Environment details

RHEL 7.7 running on both colocated bare metals and VMs, as well as AWS EC2 instances.

Issue

When a template references a Consul service, sometimes the Consul Template integration fails to render when updates occur to the service pool.

We use this all over the place in our cluster and we have seen reports for this 4 times across the past few years of running Nomad: Once 2 years ago, and now 3 times in the past 2 months. It seems to be a very rare occurrence, but we didn't have visibility into this exact issue until recently, so with this new alerting we may be able to catch this more often and provide better diagnostic information.

Reproduction steps

Lets say we have a job with a count of 5, allocations with short IDs of: 00000000-0000004 and that consul-service-a ends up being the service affected by this bug.

  1. Put one of the consul-service-a pool members into maintenance. Note that all allocations properly render the remaining service pool members to disk and are SIGHUPd
  2. Remove maintenance mode from the consul-service-a pool member. Note that all allocations properly render the remaining service pool members to disk and are SIGHUPd
  3. Wait. We're unsure what the underlying trigger is that's leading to this behavior.
  4. Eventually notice that allocation 00000000 is no longer logging Template re-rendered messages in allocation logs in the UI although its sibling allocations continue to do so.
  5. Inspect the rendered file for allocation 00000000, note that the node entries for service consul-service-a are not the same as to what is rendered for allocations 00000001-00000004
  6. While in this broken state, put a member of consul-service-b into maintenance. Note that all allocations correctly render out the node changes for the consul-service-b service pool. The rendered nodes for consul-service-a are still out of sync between allocation 00000000 and the rest at this point.
  7. Repeat previous step with consul-service-c, noting the same behavior.
  8. Bounce the Nomad agent on the host running the allocation 00000000. When the agent comes back up, the node entries for all 3 consul service are in sync between all of the allocations.

Expected Result

The Consul template integration continues to render out Consul service pool changes for all of the services referenced in the template.

Actual Result

Updates for one of the services in the affected allocation stop until the local Nomad agent is bounced.

Job file (if appropriate)

Abridged, but boils down to something like:

~~~~~ 8< ~~~~~~

group "group" {
  count = 5

  template {
    change_mode = "signal"
    change_signal = "SIGHUP"
    data = << EOH
    {{ $nodes := service "consul-service-a" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
  
    {{ $nodes := service "consul-service-b" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
  
    {{ $nodes := service "consul-service-c" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
    EOH

    destination = "local/file"
  }
}

~~~~~ 8< ~~~~~~

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

We didn't have visibility into this occurring when the incident that triggered this bug report happened, so although I'm guessing logs from this time will be the most interesting we won't be able to gather them until we see another reproduction. At least with our increased visibility into this issue we should have a better shot at gathering these. Anything else besides logs be useful for us to gather?

Otherwise.. On working agents you'll see this log message during the breakagae:

Nov 22 14:39:24 a-nomad-client.some.dc nomad[3280]: 2021/11/22 14:39:24.334287 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/552216c9-3c59-bf41-166f-104a671c0815/group/local/file"

On the broken agents, you won't.

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 25, 2021

Thanks for the report @dpn.

I can't think of anything from the top of my head that could be causing this problem, so we would need time to investigate it further. As you mentioned, this happens sporadically, which makes it harder to reproduce.

Checking the Consul logs the next time this happens may be helpful as well. It could be that the local Consul agent is losing connection with its peers.

@tgross
Copy link
Member

tgross commented Jan 10, 2022

We just landed some improvements to template configuration in #11606 that will let you fine-tune waiting, stale, and retry behaviors. That'll ship in the next scheduled version of Nomad and may help out once we've got more detailed logs from an incident to work with.

Going to mark this as waiting for reply to see if @dpn is able to produce logs, but otherwise hopefully 1.2.4 will provide them with the knobs to fix the underlying problem.

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jan 10, 2022
@dpn
Copy link
Author

dpn commented Mar 17, 2022

Thanks guys. As usual apologies for the delayed response. We have built alerting around this issue and have not seen it reproduce since the initial report, so I think we're good to go ahead and close this issue for now as I don't have logs to provide from the original incident any more. In the event we see this again I will be sure to capture them and update this issue. Thanks again!

@dpn dpn closed this as completed Mar 17, 2022
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 17, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

3 participants