Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to place allocations on clients running Consul 1.13.8 #17302

Closed
josh-m-sharpe opened this issue May 24, 2023 · 4 comments · Fixed by #17349
Closed

Fail to place allocations on clients running Consul 1.13.8 #17302

josh-m-sharpe opened this issue May 24, 2023 · 4 comments · Fixed by #17349
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug

Comments

@josh-m-sharpe
Copy link

Our stable versions:
consul: 1.13.2
nomad: 1.3.5

Decided to start upgrading, so I figured, sure, the last patch release of 1.13.x and 1.3.x respectively.
Attempted version combination:
consul: 1.13.8
nomad: 1.3.14 (I started this process before 1.3.15 dropped)

When I launch nomad with these versions, I see this output in /var/log/messages:

nomad: 2023-05-24T16:50:05.821Z [WARN]  client.fingerprint_mgr.consul: unable to fingerprint consul: attribute=consul.grpc
nomad: 2023-05-24T16:50:05.821Z [WARN]  client.fingerprint_mgr.consul: unable to fingerprint consul: attribute=consul.sku

This seems to have the effect of preventing nomad from allocating containers there. This only happened once, and it took out my site, so I'm a bit reluctant to "test" that part again. I'm super interested in understanding and eliminating this warning though. It all seems odd though, the consul agent joins the cluster and looks like it's healthy. The nomad agent joins as well but just shows an empty client - nothing gets provisioned there.

I can't find much on the internet about what those attributes are and/or what nomad is looking for. So I don't have much to go on.

By trial and error I've sorta determined that with consul 1.13.7 this issue doesn't show up, so by downgrading to:
Consul: 1.13.7
Nomad: 1.3.14
...I sorta have a new stable set of versions.

I saw this release: https://github.com/hashicorp/nomad/releases/tag/v1.3.10 which mentions consul: add client configuration for grpc_ca_file so I attempted nomad 1.3.9 with consul 1.13.8 and that still produces these warnings.

Before I go try all the versions of nomad to figure out what's going on - is there anything I've missed. I didn't see much in any of the release notes between Nomad 1.3.5 and 1.3.15. Thanks!

@erulabs
Copy link

erulabs commented May 24, 2023

We're seeing the same thing, occurred when we upgraded to 1.5.6 from 1.5.5 and consul to 1.13.8 from 1.13.7

I suspect the issue here is consul 1.13.8 instead of Nomad due to @josh-m-sharpe's versions - will test this shortly.

My intuition suggests hashicorp/consul#17270 is the source of the breakage as the initial message is "unable to fingerprint consul: attribute=consul.grpc"

edit: opened issue with consul because this appears to be an issue on their side

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented May 24, 2023

@erulabs thanks for the confirmation, but I found that consul 1.14.7 (latest version) works for us. There was a minor config change needed to launch that and everything's working. I did our staging cluster this afternoon and doing prod in the morning. So I guess I'm past whatever this issue is

@erulabs
Copy link

erulabs commented May 25, 2023

@josh-m-sharpe awesome - I'll look at going to consul 1.14 as well. Rolling back to 1.13.7 and keeping Nomad 1.5.6 works properly as well.

@lgfa29
Copy link
Contributor

lgfa29 commented May 29, 2023

Hi @josh-m-sharpe and @erulabs 👋

Thank you for the report. Upon further investigation I found out that the problem was an API breaking change in Consul where the version value returned by the /v1/agent/self endpoint has an extra line break in the end.

$ curl -s http://localhost:8500/v1/agent/self | jq '.DebugConfig.Version'
"1.13.8\n"

I'm not sure why this happened, but I opened hashicorp/consul#17503 in the Consul repo.

This extra line break causes the Nomad fingerprint to break when trying to parse the version in GRPC and SKU detectors. I opened #17349 to prevent problems like this from affecting Nomad in the future.

As far as I can tell this is the only version of Consul with this broken version return value, so other version seem safe to upgrade.

Apologies for headache during the upgrade process.

@lgfa29 lgfa29 added theme/consul stage/accepted Confirmed, and intend to work on. No timeline committment though. labels May 29, 2023
@lgfa29 lgfa29 self-assigned this May 29, 2023
@lgfa29 lgfa29 changed the title issue upgrading along the 1.3.x path - prevents nomad allocations Fail to place allocations on clients running Consul 1.13.8 May 30, 2023
@lgfa29 lgfa29 pinned this issue May 30, 2023
@jamesnyika jamesnyika unpinned this issue Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants