Services not unregistered #16616

dani · 2023-03-23T08:44:32Z

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

Environment :

Nomad 1.5.2 (with ACL), using the prebuilt binary
Consul 1.15.1 (with ACL), using the pre-built binary
Alma Linux 8

Haven't found yet a pattern to reproduce it 100% of the time

jrasell · 2023-03-23T08:49:25Z

Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.

shoenig · 2023-03-23T13:46:08Z

1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun() allocrunner hook is run, preventing further attempts at deregistration.

Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.

dani · 2023-03-23T14:14:48Z

I was just able to trigger it on my prometheus Job :

I have a version of prom running, with 2 instances, and 2 services registered on Consul
I change a config file, and run the job again, Nomad starts the rolling update
In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)

Here's my system logs during this rolling update :

nomad_unreg.txt

jrasell · 2023-03-24T12:39:23Z

Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.

dani · 2023-03-24T12:45:39Z

ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)

martdah · 2023-03-24T16:44:54Z

I have seen the same issue, I've even reproduced it using the counter demo app.
The issue only happens to me when ACL is enabled.
nomad: 1.5.2-1
consul: 1.15.1-1
ubuntu: 20.04

deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul.
exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.

hope this helps.

I have also reverted my lab to 1.5.1-1

chenjpu · 2023-03-24T23:06:34Z

I had the same problem and the way the NSD had the same problem

ngcmac · 2023-03-25T11:38:33Z

Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes.
It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).

Reverted to Nomad 1.4.5.

Regards.

jrasell · 2023-03-27T14:54:30Z

Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.

tgross · 2023-03-31T17:20:21Z

Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: #16739

jrasell · 2023-04-03T13:56:28Z

Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.

Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.

It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?

dani · 2023-04-03T14:00:13Z

In my case, I set the token in the config file, like

acl {
  enabled = true
  enable_token_persistence = true
  default_policy = "deny"
  tokens {
    default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)

ngcmac · 2023-04-05T08:36:27Z

We are also setting it via consul config:

{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

suikast42 · 2023-04-05T20:52:56Z

This issue still present in nomad v 1.5.3

CarelvanHeerden · 2023-04-13T11:26:39Z

An observation from my side.

I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

This is still on Nomad 1.5.2.
Upgrading to 1.5.3 now to test again.

suikast42 · 2023-04-13T21:39:40Z

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

1.5.3 have the same bug

rgruyters · 2023-04-16T11:55:50Z

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

suikast42 · 2023-04-16T12:39:20Z

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

I have no acl enabled. I don't think that this issue belongs to the acl system.

bfqrst · 2023-04-18T12:55:38Z

Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2 and Nomad 1.5.3...

icyleaf · 2023-04-19T02:43:24Z

I had this issue with nomad service provider without ACLs, the context details in #16890.

ubuntu 22.04.2 LTS
nomad 1.5.3
docker 23.0.3

fredwangwang · 2023-04-20T21:36:28Z

encountered this as well, I am able to reproduce quite reliably with the following sequence:

Restart Alloc

immediately after Stop Alloc

using nomad 1.4.7.

the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed

suspecting it could be related to #16289, but havent confirm

update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps

tgross · 2024-05-14T19:33:12Z

Another way to repro is to do systemctl restart nomad on client(s) - When I run this, service templates are getting messed up.

@blmhemu can you clarify this? What's "messed up" mean here? ~~(Note that #19542 is unrelated, so if the API output is fine, you should take that over to #19542 and that'll get resolved over there.)~~ Nevermind, I see what you mean in #18203.

tgross · 2024-05-14T19:35:36Z

Thanks @linuxoid69! So that looks like it's the Consul agent token configuration. The allowed configuration for that changed a while back, so that reinforces what was described in #16616 (comment).

tgross · 2024-05-15T13:00:33Z

Hi folks, just a quick update... I still don't have a reproduction for this issue. But by revisiting the code, I see a few places where it's possible to drop deregistrations and places where we could be ensuring data integrity between allocations and their service registrations.

When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services (ref node_endpoint.go#L704-L718). This makes it possible for the leader to successfully write the node update to Raft without removing services.
When the allocation is stopped, we deregister the service in the alloc runner's PreKill hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload ignores any errors. The RemoveWorkload method in the service client for NSD doesn't retry if there's any issue either, it only logs. The PreKill hook interface doesn't allow for errors to block the shutdown of an allocation, but what we can do is give the RemoveWorkload a chance to retry recoverable errors from transient network issues.
When an allocation is marked terminal from the client in UpdateAllocsFromClient, we have an opportunity to ensure data integrity by deregistering services for that allocation. This doesn't quite ensure correct behavior on its own, as we're supposed to be deregistering before the allocation is terminal. But it at least prevents the services from sitting around.
When an allocation is deleted (during eval garbage collection, see state_store.go#L3617-L3632), we have the same opportunity to ensure data integrity.

Without a reliable reproduction that covers everyone's reports I can't guarantee that fixing the above problems will close out this issue for Nomad services forever. But these are at least plausibly involved. I'm going to work up a patch or patch series for these, and should have those up for review in the next few days assuming all goes well.

Thanks again everyone for your patience with this issue!

This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616

tgross · 2024-05-15T15:36:26Z

I've broken out the data integrity fixes to #20590, and I'll do the client-side work as a separate PR.

This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616

When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616

tgross · 2024-05-15T18:39:05Z

Client-side work is up in #20596

When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616

) When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616

tgross · 2024-05-16T13:08:41Z

Ok folks, #20596 and #20590 have both been merged and will ship in the upcoming Nomad 1.8.0 (and supported backport versions). I'm going to close this issue.

"But I saw it again!"

If you see services left behind after an allocation has stopped from Nomad 1.8.0 or beyond, please let us know. We may move reports to a new issue in order to properly triage. Make sure to include the following:

Which service provider you're seeing the problem with ("consul" or "nomad")
Which version of Nomad you're using
Which version of Consul you're using (if using the "consul" service provider)
Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider)
The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
Client logs for the allocation as it's shutting down
Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)

natemollica-nm · 2024-07-03T20:12:30Z

@tgross Looks like we have indication of this again in Nomad v1.8.1

Which service provider you're seeing the problem with ("consul" or "nomad")
- Consul
Which version of Nomad you're using
- v1.8.1+ent
Which version of Consul you're using (if using the "consul" service provider)
- v1.19.0

Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider)
Nomad Agent Consul Block

consul {
   address = "127.0.0.1:8500"
   token   = "<CONSUL BOOTSTRAP TOKEN>"
 }

Consul Agent Config

ui = true
log_level = "INFO"
data_dir = "/opt/hashicorp/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
advertise_addr = "x.x.x.18"
retry_join = ["project_name=dev-nomad provider=gce tag_value=dev-dev1-dc1-nomad-auto-join"]
datacenter = "dc1"

## set up logging
log_file = "/var/log/consul/"  # service creates a log file under this dir
log_rotate_bytes = 1073741824  # 1GB
log_rotate_max_files = 1  # keep 1 additional file jic for troubleshooting
log_json = true
log_level = "INFO"
enable_syslog = false

acl {
   enabled = true
   default_policy = "deny"
   down_policy = "extend-cache"
}

connect {
 enabled = true
}
ports {
 grpc = 8502
}

The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
- Deploy Nomad job with 3 instance service count using Consul as service provider
- Update the Job or Remove the Job
- Redeploy the Nomad job and observe Consul's service catalog to note the older instances were never properly deregistered.
Nomad Client logs for the allocation as it's shutting down
- Will send separately via Slack and ZD
Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)
- Sending separately via Slack and ZD

Will work to get a working reproduction on the side to deliver if needed just let me know in Slack. Wanted to post here to keep track.

ngcmac · 2024-07-04T17:26:48Z

Hi @tgross ,

We also still have this issue in our CI environment (1 server nomad + consul + vault + 3 clients).
We start to observe an increase on this after we enabled WI for Consul.

Which service provider you're seeing the problem with ("consul" or "nomad"): Consul
Which version of Nomad you're using: 1.8.1
Which version of Consul you're using (if using the "consul" service provider): 1.18.1

Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider):

Nomad:

{
  "consul": {
    "address": "127.0.0.1:8501",
    "grpc_address": "127.0.0.1:8503",
    "auto_advertise": true,
    "server_auto_join": true,
    "client_auto_join": true,
    "token": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "ssl": true,
    "ca_file": "/opt/consul/certs/consul-ca-cert.pem",
    "cert_file": "/opt/consul/certs/consul-server-cert.pem",
    "key_file": "/opt/consul/certs/consul-server-key.pem",
    "grpc_ca_file": "/opt/consul/certs/consul-ca-cert.pem",
    "verify_ssl": false
  }
}

Consul (default config + acl):

"node_name": "<node_name>",
"addresses":{
  "http": "0.0.0.0",
  "https": "0.0.0.0",
  "dns": "0.0.0.0",
  "grpc_tls": "0.0.0.0"
},
"data_dir": "/opt/consul/data",
"log_level": "INFO",
"datacenter": "dc1",
"bind_addr": "<ip_addr>",
"advertise_addr": "<ip_addr>",
"ports": {
  "server": 8300,
  "serf_lan": 8301,
  "serf_wan": 8302,
  "http": 8500,
  "https": 8501,
  "grpc_tls": 8503,
  "dns": 8600,
  "sidecar_min_port": 21000,
  "sidecar_max_port": 21255,
  "expose_min_port": 21500,
  "expose_max_port":  21755
},
"server": false,
"retry_join": ["<consul_server>"],
"rejoin_after_leave": true,
"connect": {
  "enabled": true
},
"encrypt": "<some_secret>",
"tls": {
  "defaults": {
    "ca_file": "/opt/consul/certs/consul-ca-cert.pem",
    "cert_file": "/opt/consul/certs/consul-server-cert.pem",
    "key_file": "/opt/consul/certs/consul-server-key.pem",
    "verify_incoming": true,
    "verify_outgoing": true
  },
  "https": {
    "verify_incoming": true
  },
  "internal_rpc": {
    "verify_server_hostname": true,
    "verify_incoming": true
  },
  "grpc": {
    "verify_incoming": false
  }
},
"auto_encrypt": {
  "tls": true
},
"limits": {
  "http_max_conns_per_client": 8000,
  "https_handshake_timeout": "15s",
  "rpc_max_conns_per_client": 400,
  "rpc_handshake_timeout": "15s"
}

"acl": {
  "enabled": true,
  "tokens": {
    "agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
  },
  "default_policy": "deny",
  "down_policy": "extend-cache",
  "enable_token_persistence": true
}

The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
- A Nomad job is submitted and the service registered on Consul during the startup in an unhealthy state. For some reason (missing variables, secrets, code problem, etc...), the allocation gets unhealthy on Nomad and restarts. The allocation keeps restarting according with the restart policy. When allowed attempts are exceeded, Nomad "kills" the Job but is unable to deregister the service from Consul. We can see "ACL.not.found" errors on Consul client. We also see an increase on nomad.client.consul.sync_failure metric (was 0 before ). I need to use a management token to manually unregister it or restart the nomad node where the allocation started on the first place and got unhealthy. This keeps repeating over and over while the alloc restarts (and may happen on different nodes, using "SchedulerAlgorithm": "spread").
Client logs for the allocation as it's shutting down (in this case a new deploy for an existent job is submitted with allowed restarts 3)

Nomad:
2024-07-04T08:13:06.785415+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.785Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Received msg="Task received by client" failed=false
2024-07-04T08:13:06.820716+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.820Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Task Setup" msg="Building Task Directory" failed=false
2024-07-04T08:13:06.888908+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.888Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
2024-07-04T08:13:09.227594+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.227Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2024-07-04T08:13:09.228156+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.228Z [WARN]  agent: (clients) disabling consul SSL verification
2024-07-04T08:13:09.228795+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.228Z [INFO]  agent: (runner) creating watcher
2024-07-04T08:13:09.229282+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.229Z [INFO]  agent: (runner) starting
2024-07-04T08:13:09.250898+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.250Z [WARN]  agent: (runner) watching 186 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.291654+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.291Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logback.xml"
2024-07-04T08:13:09.316109+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.315Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logrotate"
2024-07-04T08:13:09.344981+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.344Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/envs"
2024-07-04T08:13:09.351655+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.351Z [WARN]  agent: (runner) watching 187 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.380496+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.380Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/backoffice-parking-consumer.conf"
2024-07-04T08:13:09.494995+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.494Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/main.properties"
2024-07-04T08:13:09.674300+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.674Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:13:16.995273+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:16.995Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.440496+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:17.439Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.493295+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:17.493Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:13:36.986712+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:36.986Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:13:36.993466+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:36.993Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2596806
2024-07-04T08:13:37.006807+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:37.006Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.272860626s
2024-07-04T08:13:37.007018+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:37.006Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.272860626s" failed=fals
2024-07-04T08:14:12.446927+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:12.446Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:14:12.907417+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:12.907Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.405087+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:13.404Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.463133+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:13.463Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:14:32.409339+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.408Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:14:32.416647+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.416Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2604407
2024-07-04T08:14:32.428570+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.428Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.718148757s
2024-07-04T08:14:32.428710+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.428Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.718148757s" failed=false
2024-07-04T08:15:08.255694+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:08.254Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:15:08.761774+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:08.761Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.206826+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:09.206Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.257945+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:09.257Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:15:27.751406+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.751Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:15:27.758280+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.758Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2611952
2024-07-04T08:15:27.769543+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.769Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=32.650043057s
2024-07-04T08:15:27.769613+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.769Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 32.650043057s" failed=false
2024-07-04T08:16:00.454924+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:00.454Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:16:00.873403+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:00.872Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.344882+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:01.344Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.406204+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:01.406Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:16:19.587339+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.587Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:16:19.593708+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.593Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2618891
2024-07-04T08:16:19.604943+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.604Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\""
2024-07-04T08:16:19.605031+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.604Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Not Restarting" msg="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\"" failed=true
2024-07-04T08:16:19.614604+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.614Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer plugin=/opt/nomad/bin/nomad id=2595171
2024-07-04T08:16:19.614846+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.614Z [INFO]  agent: (runner) stopping
2024-07-04T08:16:19.616711+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.616Z [INFO]  client.gc: marking allocation for GC: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e
2024-07-04T08:16:19.616802+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.616Z [INFO]  agent: (runner) received finish
2024-07-04T08:16:19.639835+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.637Z [WARN]  consul.sync: failed to update services in Consul: error="Unexpected response code: 403 (ACL not found)"

Consul:
2024-07-04T08:13:06.832454+00:00 <node_name> consul[4075309]: 2024-07-04T08:13:06.832Z [INFO]  agent: Synced service: service=_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer-
2024-07-04T08:16:19.637477+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:19.639781+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:20.643069+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:20.643185+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:22.649259+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:22.649380+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:25.657196+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:25.657348+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:29.664423+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:29.664523+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:34.670396+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:34.670535+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:40.676407+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:40.676589+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:47.682542+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.681Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:47.682691+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.682Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:55.689246+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:55.689333+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"

Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)
Nothing with "log_level": "INFO"

Thanks

tgross · 2024-07-08T18:39:12Z

@ngcmac I'm going to continue the investigation for the case where we're getting "ACL not found" with Workload Identities in #23494. @natemollica-nm is going to do some follow-up on his report and we may or may not adjust #23494 to cover that case as well depending on the outcome.

jrasell self-assigned this Mar 23, 2023

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 23, 2023

jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 23, 2023

tgross added theme/consul theme/service-discovery/consul type/bug labels Mar 24, 2023

tgross mentioned this issue Mar 31, 2023

Nomad v 1.5.2 tracks dead allocations #16739

Closed

lgfa29 mentioned this issue Apr 18, 2023

Error: Not Found Allocation of id: [object Object] was not found. #16890

Closed

shoenig mentioned this issue Apr 24, 2023

services: un-mark group services as deregistered if restart hook runs #16905

Merged

tgross mentioned this issue May 14, 2024

mishandling of service deregistration with script healthcheck #10482

Closed

tgross mentioned this issue May 15, 2024

services: fix data integrity errors for Nomad native services #20590

Merged

This was referenced May 15, 2024

Backport of services: fix data integrity errors for Nomad native services into release/1.6.x #20591

Merged

Backport of services: fix data integrity errors for Nomad native services into release/1.7.x #20592

Merged

tgross mentioned this issue May 15, 2024

services: retry failed Nomad service deregistrations from client #20596

Merged

This was referenced May 16, 2024

Backport of services: retry failed Nomad service deregistrations from client into release/1.6.x #20606

Merged

Backport of services: retry failed Nomad service deregistrations from client into release/1.7.x #20607

Merged

tgross closed this as completed May 16, 2024

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done May 16, 2024

ngcmac mentioned this issue Jul 8, 2024

Consul service deregistration gets "ACL not found" w/ Workload Identities #23494

Open

natemollica-nm mentioned this issue Jul 10, 2024

Nomad Consul Legacy ACL Workflow Failing to Deregister Catalog Services on Job Updates #23534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Services not unregistered #16616

Services not unregistered #16616

dani commented Mar 23, 2023

jrasell commented Mar 23, 2023

shoenig commented Mar 23, 2023

dani commented Mar 23, 2023

jrasell commented Mar 24, 2023

dani commented Mar 24, 2023

martdah commented Mar 24, 2023 •

edited

Loading

chenjpu commented Mar 24, 2023

ngcmac commented Mar 25, 2023

jrasell commented Mar 27, 2023

tgross commented Mar 31, 2023

jrasell commented Apr 3, 2023 •

edited

Loading

dani commented Apr 3, 2023

ngcmac commented Apr 5, 2023

suikast42 commented Apr 5, 2023

CarelvanHeerden commented Apr 13, 2023 •

edited

Loading

suikast42 commented Apr 13, 2023

rgruyters commented Apr 16, 2023

suikast42 commented Apr 16, 2023 •

edited

Loading

bfqrst commented Apr 18, 2023

icyleaf commented Apr 19, 2023

fredwangwang commented Apr 20, 2023 •

edited

Loading

tgross commented May 14, 2024 •

edited

Loading

tgross commented May 14, 2024

tgross commented May 15, 2024 •

edited

Loading

tgross commented May 15, 2024

tgross commented May 15, 2024

tgross commented May 16, 2024

natemollica-nm commented Jul 3, 2024

ngcmac commented Jul 4, 2024

tgross commented Jul 8, 2024

Services not unregistered #16616

Services not unregistered #16616

Comments

dani commented Mar 23, 2023

jrasell commented Mar 23, 2023

shoenig commented Mar 23, 2023

dani commented Mar 23, 2023

jrasell commented Mar 24, 2023

dani commented Mar 24, 2023

martdah commented Mar 24, 2023 • edited Loading

chenjpu commented Mar 24, 2023

ngcmac commented Mar 25, 2023

jrasell commented Mar 27, 2023

tgross commented Mar 31, 2023

jrasell commented Apr 3, 2023 • edited Loading

dani commented Apr 3, 2023

ngcmac commented Apr 5, 2023

suikast42 commented Apr 5, 2023

CarelvanHeerden commented Apr 13, 2023 • edited Loading

suikast42 commented Apr 13, 2023

rgruyters commented Apr 16, 2023

suikast42 commented Apr 16, 2023 • edited Loading

bfqrst commented Apr 18, 2023

icyleaf commented Apr 19, 2023

fredwangwang commented Apr 20, 2023 • edited Loading

tgross commented May 14, 2024 • edited Loading

tgross commented May 14, 2024

tgross commented May 15, 2024 • edited Loading

tgross commented May 15, 2024

tgross commented May 15, 2024

tgross commented May 16, 2024

natemollica-nm commented Jul 3, 2024

ngcmac commented Jul 4, 2024

tgross commented Jul 8, 2024

martdah commented Mar 24, 2023 •

edited

Loading

jrasell commented Apr 3, 2023 •

edited

Loading

CarelvanHeerden commented Apr 13, 2023 •

edited

Loading

suikast42 commented Apr 16, 2023 •

edited

Loading

fredwangwang commented Apr 20, 2023 •

edited

Loading

tgross commented May 14, 2024 •

edited

Loading

tgross commented May 15, 2024 •

edited

Loading