consul connect is extremely unreliable at startup #9307

kneufeld · 2020-11-10T17:13:08Z

I've just had multiple multi-hour outages in the past two days because consul connect proxies would not start.

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

That's it, no logs because it never actually started. Granted this is a complicated job with 5 groups and proxies between them but it used to work. It seems as soon as I had multiple jobs (built from a template) the proxies started to act up. Proxy names are all unique based on job name and group.

All I can do it just repeatedly hit restart until the gods smile on me and the proxy finally starts.

Nomad version

Started at 0.11.6 and desperately tried to upgrade to 0.12.7 to no avail.

Nomad v0.12.7 (6147cb578794cb2d0c35d68fe1791728a09bb081)

Operating system and Environment details

Ubuntu 18.04 and Ubuntu 20.04

Reproduction steps

Make a job with multiple groups and multiple connect sidecar between them (but not circular). The first instance of that job will probably work. Make 10 more.

After investing a lot of time and energy in nomad I'm afraid I'll have drop it for I have no idea what I'm going to do...

The text was updated successfully, but these errors were encountered:

shoenig · 2020-11-10T17:39:55Z

Hi @kneufeld sorry you're having trouble with this.

Seeing the exit status 1 error message for creating bootstrap configuration indicates something went wrong when Nomad execs consul to retrieve the envoy bootstrap configuration from Consul. Did you happen to capture Consul logs while this is happening, which may shed some light into what's going on? What version of Consul are you using?

kneufeld · 2020-11-10T17:50:34Z

Nov 10 08:54:21 clients-nomad1.example.com consul[6571]:     2020-11-10T08:54:21.287-0800 [ERROR] agent.proxycfg: watch error: service_id=_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-     <redacted>-proxy-redis--sidecar-proxy id=service-http-checks:_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis- error="error filling agent cache: Internal cache failure: service '_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis-' not in agent state"

This is the only thing I can find that looks interesting.

# consul version
Consul v1.8.5
Revision 1e03567d3
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

I'd say there is a race condition between groups (I'm actually pretty positive there is) but this error happens with my redis task which is at the "bottom" on which everything else has a proxy connect to. ie: redis is the "server" proxy and does not have any "client" proxy

shoenig · 2020-11-10T20:44:26Z

Can you take a look at the Nomad agent logs when this happens? There should be a log line with the stderr output from exec-ing consul binary with relevent error message. For example something like,

2020-11-10T14:40:33.171-0600 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=8df94e24-253c-5fed-acde-63b42af168a8 task=connect-proxy-count-api error="exit status 1" stderr="This error message goes to stderr!"

kneufeld · 2020-11-11T04:57:45Z

Found these lines

Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:01.763-0800 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="exit status 1" stderr="==> No sidecar proxy registered for _nomad-task-ad06b594-d5b0-331c-6c3f-4843b2070eb8-group-housekeeper-<redacted>-centre-proxy-usher-
Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]: "
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis reason="Restart within policy" delay=11.306497887s
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.584-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.585-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher reason="Restart within policy" delay=11.306497887s

shoenig · 2020-11-16T16:14:53Z

Hi @kneufeld , the -beta3 release contains #9330 which may help alleviate the issue, could you give that a try? Basically Nomad since 0.12 has been re-registering services with connect sidecars any time it needed to reconcile with Consul. On a busy cluster, this could mean Nomad is doing so continuously, possibly causing enough delays in registration of new services to cause the symptoms you are seeing.

kneufeld · 2020-11-16T16:41:36Z

that's v1.0-beta3? And I assume servers and clients both need to be upgraded?

Is there a way to restart nomad on a client without restarting the docker containers? I don't want to restart nomad if that will cause the containers to migrate because that will cause an outage on any machine not running v1. And that's assuming the problem is fixed.

Thanks for the quick turn around time and follow up.

kneufeld · 2020-11-20T22:41:12Z

I migrated my clients off of nomad and into their own vms and started with docker-compose.

This allowed me to upgrade my west datacenter to to v1.0-beta3. My east dc still has clients on 0.12 but the servers over there have been upgraded.

First tests looked good but then I started getting the same bootstrap errors. I may have found an other piece of the puzzle

Evaluations area insta failing.

nomad plan job.nomad
nomad job run -check-index 0 job.nomad

Then in hashi-ui I see the following instantly:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: job-registered

followed by a never ending stream of:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: max-plan-attempts

and

Status:    blocked
StatusDescription:     created due to placement conflicts
TriggeredBy:    max-plan-attempts

but I kinda call BS on placement conflicts since the original nomad plan passed and I have more nodes in my west dc than tasks and no constraints in my job file.

Is it possible that my cluster is just corrupt somehow? If so then how can nomad be trusted?

My final hope is to migrate the rest of my clients off nomad in the east dc and re-bootstrap the entire cluster. I have zero doubts that will work at first but long term?

ps. I've nomad stop -purge several times

kneufeld · 2020-11-21T03:33:01Z

@shoenig I got the rest of my jobs off the cluster and rebuilt it. My quick test works, tasks scale etc.

I really don't know what to think...

shoenig · 2020-11-23T15:47:44Z

Hi @kneufeld sorry this has caused so much trouble. The Nomad v0.12 release was centered around major underlying network architecture changes, and we've been squashing followup bugs in the point releases since. Though we hope for a smooth transition, the changes were made before Nomad v1.0 specifically so we could provide the stability guarantees of a 1.0 release going forward.

It sounds like the rebuilt cluster is operating okay?

I suspect the evaluation failures may have been caused by the problem described in #9356. The fix for that isn't present in the -beta3, but should be coming in a release soon.

kneufeld · 2020-11-23T16:39:44Z

@shoenig I'm only running a single job but I never saw any issues until I tried to run many jobs. I do think the cluster was somehow corrupted which is it's own very troubling thought.

I'm opening another bug that I saw pre-rebuild and post-rebuild with only a single job running.

kneufeld · 2020-12-01T01:57:07Z

@shoenig I just saw the same insta-blocked deployments/evaulations with only a single job in the cluster. I shutdown the leader to force an election and once a new leader was chosen I could successfully deploy a job.

This was after stopping all servers and clients, deleting /var/lib/nomad. Aka, I rebuilt a brand new cluster. Could there be some lingering schmoo in consul that's breaking everything?

kneufeld · 2020-12-01T19:53:01Z

@shoenig I just received the same

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

error after starting 3 jobs in my cluster. This bug is not remotely fixed. I emailed my job file in issue #9431

Legogris · 2021-01-20T06:40:45Z

I'm getting the above behavior when attempting to use Connect on an existing cluster (Consul 1.9.1, Nomad 1.0.2, ACLs and TLS enabled). No errors in log about bootstrap error for Connect though, so maybe it's a similar manifestation of something else.

FWIW I can't get Connenct to work at all (using the Counter Dashboard example) even when making sure node is warm before submitting a single job. Nothing stands out in logs for consul or nomad, or the envoy containers, apart from Using deprecated option ... warnings in Envoy (tried with 1.13,1.14,1.15).

holtwilkins · 2021-01-20T07:00:49Z

I've seen this error in particular alongside the generated Nomad ACL tokens. Namely, what seems to happen is that as part of creating a new allocation, Nomad spins-up a service Consul token for the sidecar, and then immediately tries to use it. This token hasn't propagated quite yet into Consul, so I see in the Nomad logs some 403 ACL token not found errors, then eventually when the alloc keeps retrying, as it keeps using the same service token, it eventually becomes valid and the allocation succeeds.

So, it seems like the sidecar needs to wait a little bit more or have some retries built-in at an earlier phase rather than failing the whole allocation immediately and relying on the general allocation restart behaviour to sort it out.

Legogris · 2021-01-20T07:17:05Z

Sounds different from what I am experiencing here (though I think I've absolutely seen what you mention as well before).

All checks passing, services considered healthy, no errors in either consul or nomad

shevagit · 2021-03-11T14:00:50Z

We're getting the same error, without using ACL tokens, connect fails sometimes causing the tasks to fail.

"envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

betterthanbreakfast · 2021-04-14T08:55:44Z

Hello, small follow-up:
We've got nomad 1.0.4 with TLS, but without ACL's atm.
We've run into same issue, job fails during startup with "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

In our case, reason was in service name, name contains UPPERCASE characters. After transform em to lowercase, problem was solved.

Legogris · 2021-04-22T13:05:54Z

Got hit by this again now when updating an existing job (the only thing changed: no-op change inside a generated template)

Stopping the job and starting it again:

[ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=foobar task=connect-proxy-deadbeef error="exit status 1" stderr="==> No proxy ID specified. One of -proxy-id, -sidecar-for, or -gateway is required
[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=foobar task=connect-proxy-deadbeef error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

Tried the following, does not resolve it:

Stopping the job
Draining the client
Stopping nomad on the client
Doing a full reboot of the client instance
doing a GC (making sure the stopped job is forgotten)
Starting nomad on the client again
Scheduling a new instance of the job

Same sidecar proxy (there are 4 in total on the group) keeps failing with the same errors.

Consul logs around the same time:

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.34:8300 error="rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.33:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"

Legogris · 2021-04-22T13:39:24Z

Stopped all other jobs related to the failing sidecar container and performed a rolling restart of all consul servers and nomad servers and rescheduled - still same.

Observation: When draining the node, the 3 other sidecar tasks for the group with the failing one are stuck in running despite the main tasks being successfully killed. First time I see a drain taking more than a few seconds. Only doing a manual stop successfully stops the job fully.

In the "zombie" sidecar tasks I see in stderr after the drain (apart from all the deprecation warnings):

[info][upstream] [source/common/upstream/cds_api_impl.cc:64] cds: add 1 cluster(s), remove 1 cluster(s)

Legogris · 2021-04-22T14:00:19Z

The error seems to trigger here: https://github.com/hashicorp/consul/blob/8f3223a98456d3fb816897bdf5eb4fbf1c9f4109/command/connect/envoy/envoy.go#L289

Legogris · 2021-04-22T14:15:40Z

Arbitrarily changing the service name (amending intentions accordingly) made it come back up again. So seems to be the exact same thing @flytocolors was having (the casing for them is most likely coincidental though as that was not a factor here)

shoenig · 2021-04-26T16:28:26Z

@Legogris are you using Ingress Gateways, and on which version of Nomad? We fixed a bug that would trigger that error in Nomad v1.0.4. If we're still running into that error with just normal sidecars we can investigate further. In particular this log line (with DEBUG enabled) should shed some light on what's happening.

shoenig · 2021-04-26T17:44:55Z

I don't think this would apply to anyone using Nomad, but we can also do #10452 which would trigger the aforementioned error if more than one sidecar is being registered per parent service. (You'd have to be doing something interesting like an out-of-band registration.)

Legogris · 2021-04-27T00:33:48Z

@shoenig Do you have some other reason to believe that uppercase service names are indeed a problem? I had the exact same behavior as @flytocolors, except no uppercase and in my case the problem was resolved by adding a 2 at the end of the service name.

We're running v1.0.4, just "normal" sidecars (no ingress gateways)

shoenig · 2021-04-30T17:11:03Z

@Legogris Sorry for the confusion, but no - the case sensitivity thing is just one of I think at least 4 causes identified so far for connect startup problems,

parent service definition propagation delay connect: improve envoy bootstrap coordination #10451
minted service identity token propagation delay connect: improve envoy bootstrap coordination #10451
upper case characters not supported when using -sidecar-for connect: disallow uppercase characters in connect service names #10450
bug preventing more than 1 ingress gateway per group consul/connect: Enable running multiple ingress gateways per Nomad agent #9849

Those last two have definite fixes, and the first two we're trying #10453 - which increases the allowable wait time on Consul from 6 to 60 seconds. If that still doesn't work then we might need some help from Consul on creating a better way to coordinate on propagation.

The behavior you're seeing could also involve one of those first two points, but this log line suggests there's an insufficient Consul agent ACL at play, eventually leading to envoy bootstrapping failing.

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"

leanst-daniel · 2021-06-04T10:55:43Z

I've deployed a small environment in my lab to test our brand new deployments.
The configuration only contains the very basics and no ACLs but I also run into this issue:

Jun 04 10:48:46 nomad1 consul[420]:     2021-06-04T10:48:46.827Z [INFO]  agent: Synced check: check=service:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy:2
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.247Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.247Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [INFO]  agent: Deregistered service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.263Z [INFO]  agent: Deregistered service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.264Z [WARN]  agent: Failed to deregister service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy error="Service "_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy" does not exist"
Jun 04 10:49:07 nomad1 consul[420]:     2021-06-04T10:49:07.159Z [INFO]  agent: Synced service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001
Jun 04 10:49:07 nomad1 consul[420]:     2021-06-04T10:49:07.174Z [INFO]  agent: Synced service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.809Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [INFO]  agent: Deregistered service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.814Z [INFO]  agent: Deregistered service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.815Z [WARN]  agent: Failed to deregister service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy error="Service "_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy" does not exist"
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.822Z [INFO]  agent: Synced service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.839Z [INFO]  agent: Synced service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy
Jun 04 10:49:10 nomad1 consul[420]:     2021-06-04T10:49:10.300Z [INFO]  agent: Synced check: check=service:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-pro

Running on Nomad 1.1.0 and Consul 1.9.5

Thanks!

jhitt25 · 2022-02-21T20:47:52Z

Not to "me too" an issue, but im getting this sporadically as well. We're on consul 1.11.2 and nomad 1.2.4, using ACLs for both. Our cluster is running just fine and i can even stop/start/restart existing jobs, but I have a new job that is refusing to start right now. This job has been running off and on for the past week as i tweak things for our environment, but as of this morning it is consistently failing with this exact scenario.

We're testing a 2 node keycloak cluster in Nomad/Consul Connect with a Treafik frontend. The goal is to have 3 ports - service, health check, and jgroups (for the JBoss clustering). The jgroups port is outside connect, and the health check is a manual 'expose' because it runs on a different port than the service. This has worked fine for going on 2 weeks, but this morning started in with the same issue described here. I don't know if some facet of what we're doing adds any new information, but i figured it couldn't hurt.

I've attached our job file along with the excerpts of the consul and nomad logs as well if it can help.

nomad-job.txt
consul-log.txt
nomad-job.txt

jhitt25 · 2022-02-22T21:37:53Z

As an update to my addition, i was able to get past the issue by moving my non-connect'ed service (for jgroups) from the group level into the task level.

schemmerling · 2022-03-19T23:06:28Z

Well, I will intentionally leave a "me too" here - Note the following error message:
Mar 19 16:51:28 TEST sh[17233]: 2022-03-19T16:51:28.860-0600 [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=global service_id=global id=service-http-checks:hello error="error filling agent cache: Internal cache failure: service 'hello' not in agent state"
Consul v1.11.4

tgross · 2024-05-10T15:22:46Z

Leaving a note here based on a conversation I had with @schmichael and some folks from Consul recently. We may be running into the issue fixed for the Consul K8s plugin here: hashicorp/consul-k8s#887 Effectively we're creating the token and then the agent may have a stale view of the world when we use that token. Unfortunately this PR isn't something we can just drop into the Nomad code base because the architecture is a good bit different, but that's a promising lead for when this issue gets revisited next.

tgross · 2024-06-14T20:52:45Z

I've got an early prototype branch b-consul-token-read-self-on-login that covers the approach described above. I'll be working on getting that completed to cover the legacy workflow next, and then figuring out a way that we can validate that the fix is useful.

Internal ref: https://hashicorp.atlassian.net/browse/NET-10051

Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051

Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051

tgross · 2024-06-27T14:16:37Z

I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

shoenig added stage/needs-investigation theme/consul/connect Consul Connect integration labels Nov 10, 2020

shoenig mentioned this issue Nov 10, 2020

error filling agent cache: Internal cache failure hashicorp/consul#8657

Open

shoenig mentioned this issue Nov 11, 2020

consul: prevent re-registration churn by correctly comparing connect sidecar tags #9330

Merged

shoenig self-assigned this Nov 12, 2020

tgross mentioned this issue Dec 7, 2020

health check fails during grace period #9431

Open

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

shoenig mentioned this issue Apr 26, 2021

connect: disallow uppercase characters in connect service names #10450

Closed

shoenig mentioned this issue Jun 21, 2021

service_hook calls deregister a whole lot with nothing to do #10797

Closed

jrasell unassigned shoenig Aug 30, 2021

watsonian added the hcc/cst Admin - internal label Sep 22, 2021

mmcquillan added the hcc/jira label Jun 13, 2024

tgross self-assigned this Jun 14, 2024

tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Jun 14, 2024

tgross mentioned this issue Jun 17, 2024

Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516

Closed

tgross mentioned this issue Jun 18, 2024

Consul: add preflight checks for Envoy bootstrap #23381

Merged

tgross closed this as completed in #23381 Jun 27, 2024

tgross closed this as completed in df67e74 Jun 27, 2024

hc-github-team-nomad-core mentioned this issue Jun 27, 2024

Backport of Consul: add preflight checks for Envoy bootstrap into release/1.8.x #23449

Merged

tgross added this to the 1.8.2 milestone Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul connect is extremely unreliable at startup #9307

consul connect is extremely unreliable at startup #9307

kneufeld commented Nov 10, 2020

shoenig commented Nov 10, 2020

kneufeld commented Nov 10, 2020 •

edited

Loading

shoenig commented Nov 10, 2020

kneufeld commented Nov 11, 2020

shoenig commented Nov 16, 2020

kneufeld commented Nov 16, 2020 •

edited

Loading

kneufeld commented Nov 20, 2020 •

edited

Loading

kneufeld commented Nov 21, 2020

shoenig commented Nov 23, 2020

kneufeld commented Nov 23, 2020

kneufeld commented Dec 1, 2020

kneufeld commented Dec 1, 2020 •

edited

Loading

Legogris commented Jan 20, 2021 •

edited

Loading

holtwilkins commented Jan 20, 2021

Legogris commented Jan 20, 2021

shevagit commented Mar 11, 2021

betterthanbreakfast commented Apr 14, 2021

Legogris commented Apr 22, 2021 •

edited

Loading

Legogris commented Apr 22, 2021 •

edited

Loading

Legogris commented Apr 22, 2021

Legogris commented Apr 22, 2021 •

edited

Loading

shoenig commented Apr 26, 2021

shoenig commented Apr 26, 2021

Legogris commented Apr 27, 2021

shoenig commented Apr 30, 2021 •

edited

Loading

leanst-daniel commented Jun 4, 2021 •

edited

Loading

jhitt25 commented Feb 21, 2022

jhitt25 commented Feb 22, 2022

schemmerling commented Mar 19, 2022 •

edited

Loading

tgross commented May 10, 2024 •

edited

Loading

tgross commented Jun 14, 2024 •

edited

Loading

tgross commented Jun 27, 2024

consul connect is extremely unreliable at startup #9307

consul connect is extremely unreliable at startup #9307

Comments

kneufeld commented Nov 10, 2020

Nomad version

Operating system and Environment details

Reproduction steps

shoenig commented Nov 10, 2020

kneufeld commented Nov 10, 2020 • edited Loading

shoenig commented Nov 10, 2020

kneufeld commented Nov 11, 2020

shoenig commented Nov 16, 2020

kneufeld commented Nov 16, 2020 • edited Loading

kneufeld commented Nov 20, 2020 • edited Loading

kneufeld commented Nov 21, 2020

shoenig commented Nov 23, 2020

kneufeld commented Nov 23, 2020

kneufeld commented Dec 1, 2020

kneufeld commented Dec 1, 2020 • edited Loading

Legogris commented Jan 20, 2021 • edited Loading

holtwilkins commented Jan 20, 2021

Legogris commented Jan 20, 2021

shevagit commented Mar 11, 2021

betterthanbreakfast commented Apr 14, 2021

Legogris commented Apr 22, 2021 • edited Loading

Legogris commented Apr 22, 2021 • edited Loading

Legogris commented Apr 22, 2021

Legogris commented Apr 22, 2021 • edited Loading

shoenig commented Apr 26, 2021

shoenig commented Apr 26, 2021

Legogris commented Apr 27, 2021

shoenig commented Apr 30, 2021 • edited Loading

leanst-daniel commented Jun 4, 2021 • edited Loading

jhitt25 commented Feb 21, 2022

jhitt25 commented Feb 22, 2022

schemmerling commented Mar 19, 2022 • edited Loading

tgross commented May 10, 2024 • edited Loading

tgross commented Jun 14, 2024 • edited Loading

tgross commented Jun 27, 2024

kneufeld commented Nov 10, 2020 •

edited

Loading

kneufeld commented Nov 16, 2020 •

edited

Loading

kneufeld commented Nov 20, 2020 •

edited

Loading

kneufeld commented Dec 1, 2020 •

edited

Loading

Legogris commented Jan 20, 2021 •

edited

Loading

Legogris commented Apr 22, 2021 •

edited

Loading

Legogris commented Apr 22, 2021 •

edited

Loading

Legogris commented Apr 22, 2021 •

edited

Loading

shoenig commented Apr 30, 2021 •

edited

Loading

leanst-daniel commented Jun 4, 2021 •

edited

Loading

schemmerling commented Mar 19, 2022 •

edited

Loading

tgross commented May 10, 2024 •

edited

Loading

tgross commented Jun 14, 2024 •

edited

Loading