Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services not unregistered #16616

Closed
dani opened this issue Mar 23, 2023 · 59 comments · Fixed by #16905
Closed

Services not unregistered #16616

dani opened this issue Mar 23, 2023 · 59 comments · Fixed by #16905

Comments

@dani
Copy link

dani commented Mar 23, 2023

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

image

Environment :

  • Nomad 1.5.2 (with ACL), using the prebuilt binary
  • Consul 1.15.1 (with ACL), using the pre-built binary
  • Alma Linux 8

Haven't found yet a pattern to reproduce it 100% of the time

@jrasell
Copy link
Member

jrasell commented Mar 23, 2023

Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.

@jrasell jrasell self-assigned this Mar 23, 2023
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 23, 2023
@jrasell jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 23, 2023
@shoenig
Copy link
Member

shoenig commented Mar 23, 2023

1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun() allocrunner hook is run, preventing further attempts at deregistration.

Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.

@dani
Copy link
Author

dani commented Mar 23, 2023

I was just able to trigger it on my prometheus Job :

  • I have a version of prom running, with 2 instances, and 2 services registered on Consul
  • I change a config file, and run the job again, Nomad starts the rolling update
  • In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)

Here's my system logs during this rolling update :

nomad_unreg.txt

@jrasell
Copy link
Member

jrasell commented Mar 24, 2023

Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.

@dani
Copy link
Author

dani commented Mar 24, 2023

ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)

@martdah
Copy link

martdah commented Mar 24, 2023

I have seen the same issue, I've even reproduced it using the counter demo app.
The issue only happens to me when ACL is enabled.
nomad: 1.5.2-1
consul: 1.15.1-1
ubuntu: 20.04

deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul.
exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.

hope this helps.

I have also reverted my lab to 1.5.1-1

@chenjpu
Copy link

chenjpu commented Mar 24, 2023

I had the same problem and the way the NSD had the same problem

@ngcmac
Copy link

ngcmac commented Mar 25, 2023

Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes.
It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).

Reverted to Nomad 1.4.5.

Regards.

@jrasell
Copy link
Member

jrasell commented Mar 27, 2023

Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.

@tgross
Copy link
Member

tgross commented Mar 31, 2023

Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: #16739

@jrasell
Copy link
Member

jrasell commented Apr 3, 2023

Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.

Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.

It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?

@dani
Copy link
Author

dani commented Apr 3, 2023

In my case, I set the token in the config file, like

acl {
  enabled = true
  enable_token_persistence = true
  default_policy = "deny"
  tokens {
    default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)

@ngcmac
Copy link

ngcmac commented Apr 5, 2023

We are also setting it via consul config:

{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

@suikast42
Copy link
Contributor

This issue still present in nomad v 1.5.3

@CarelvanHeerden
Copy link

CarelvanHeerden commented Apr 13, 2023

An observation from my side.

I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

This is still on Nomad 1.5.2.
Upgrading to 1.5.3 now to test again.

@suikast42
Copy link
Contributor

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

1.5.3 have the same bug

@rgruyters
Copy link

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

@suikast42
Copy link
Contributor

suikast42 commented Apr 16, 2023

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

I have no acl enabled. I don't think that this issue belongs to the acl system.

@bfqrst
Copy link

bfqrst commented Apr 18, 2023

Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2 and Nomad 1.5.3...

@icyleaf
Copy link

icyleaf commented Apr 19, 2023

I had this issue with nomad service provider without ACLs, the context details in #16890.

ubuntu 22.04.2 LTS
nomad 1.5.3
docker 23.0.3

@fredwangwang
Copy link
Contributor

fredwangwang commented Apr 20, 2023

encountered this as well, I am able to reproduce quite reliably with the following sequence:

  1. Restart Alloc

image

  1. immediately after Stop Alloc

image

using nomad 1.4.7.

image

the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed

suspecting it could be related to #16289, but havent confirm

update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps

@tgross
Copy link
Member

tgross commented May 14, 2024

Another way to repro is to do systemctl restart nomad on client(s) - When I run this, service templates are getting messed up.

@blmhemu can you clarify this? What's "messed up" mean here? (Note that #19542 is unrelated, so if the API output is fine, you should take that over to #19542 and that'll get resolved over there.) Nevermind, I see what you mean in #18203.

@tgross
Copy link
Member

tgross commented May 14, 2024

Thanks @linuxoid69! So that looks like it's the Consul agent token configuration. The allowed configuration for that changed a while back, so that reinforces what was described in #16616 (comment).

@tgross
Copy link
Member

tgross commented May 15, 2024

Hi folks, just a quick update... I still don't have a reproduction for this issue. But by revisiting the code, I see a few places where it's possible to drop deregistrations and places where we could be ensuring data integrity between allocations and their service registrations.

  • When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services (ref node_endpoint.go#L704-L718). This makes it possible for the leader to successfully write the node update to Raft without removing services.

  • When the allocation is stopped, we deregister the service in the alloc runner's PreKill hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload ignores any errors. The RemoveWorkload method in the service client for NSD doesn't retry if there's any issue either, it only logs. The PreKill hook interface doesn't allow for errors to block the shutdown of an allocation, but what we can do is give the RemoveWorkload a chance to retry recoverable errors from transient network issues.

  • When an allocation is marked terminal from the client in UpdateAllocsFromClient, we have an opportunity to ensure data integrity by deregistering services for that allocation. This doesn't quite ensure correct behavior on its own, as we're supposed to be deregistering before the allocation is terminal. But it at least prevents the services from sitting around.

  • When an allocation is deleted (during eval garbage collection, see state_store.go#L3617-L3632), we have the same opportunity to ensure data integrity.

Without a reliable reproduction that covers everyone's reports I can't guarantee that fixing the above problems will close out this issue for Nomad services forever. But these are at least plausibly involved. I'm going to work up a patch or patch series for these, and should have those up for review in the next few days assuming all goes well.

Thanks again everyone for your patience with this issue!

tgross added a commit that referenced this issue May 15, 2024
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: #16616
tgross added a commit that referenced this issue May 15, 2024
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: #16616
@tgross
Copy link
Member

tgross commented May 15, 2024

I've broken out the data integrity fixes to #20590, and I'll do the client-side work as a separate PR.

tgross added a commit that referenced this issue May 15, 2024
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: #16616
tgross added a commit that referenced this issue May 15, 2024
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: #16616
tgross added a commit that referenced this issue May 15, 2024
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: #16616
tgross pushed a commit that referenced this issue May 15, 2024
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
tgross pushed a commit that referenced this issue May 15, 2024
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
tgross added a commit that referenced this issue May 15, 2024
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
@tgross
Copy link
Member

tgross commented May 15, 2024

Client-side work is up in #20596

tgross added a commit that referenced this issue May 15, 2024
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
tgross added a commit that referenced this issue May 16, 2024
)

When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
@tgross
Copy link
Member

tgross commented May 16, 2024

Ok folks, #20596 and #20590 have both been merged and will ship in the upcoming Nomad 1.8.0 (and supported backport versions). I'm going to close this issue.

"But I saw it again!"

If you see services left behind after an allocation has stopped from Nomad 1.8.0 or beyond, please let us know. We may move reports to a new issue in order to properly triage. Make sure to include the following:

  • Which service provider you're seeing the problem with ("consul" or "nomad")
  • Which version of Nomad you're using
  • Which version of Consul you're using (if using the "consul" service provider)
  • Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider)
  • The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
  • Client logs for the allocation as it's shutting down
  • Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)

@tgross tgross closed this as completed May 16, 2024
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done May 16, 2024
@natemollica-nm
Copy link

@tgross Looks like we have indication of this again in Nomad v1.8.1

  • Which service provider you're seeing the problem with ("consul" or "nomad")
    • Consul
  • Which version of Nomad you're using
    • v1.8.1+ent
  • Which version of Consul you're using (if using the "consul" service provider)
    • v1.19.0
  • Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider)
    Nomad Agent Consul Block
    consul {
       address = "127.0.0.1:8500"
       token   = "<CONSUL BOOTSTRAP TOKEN>"
     }
    Consul Agent Config
    ui = true
    log_level = "INFO"
    data_dir = "/opt/hashicorp/consul/data"
    bind_addr = "0.0.0.0"
    client_addr = "0.0.0.0"
    advertise_addr = "x.x.x.18"
    retry_join = ["project_name=dev-nomad provider=gce tag_value=dev-dev1-dc1-nomad-auto-join"]
    datacenter = "dc1"
    
    ## set up logging
    log_file = "/var/log/consul/"  # service creates a log file under this dir
    log_rotate_bytes = 1073741824  # 1GB
    log_rotate_max_files = 1  # keep 1 additional file jic for troubleshooting
    log_json = true
    log_level = "INFO"
    enable_syslog = false
    
    acl {
       enabled = true
       default_policy = "deny"
       down_policy = "extend-cache"
    }
    
    connect {
     enabled = true
    }
    ports {
     grpc = 8502
    }
  • The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")
    • Deploy Nomad job with 3 instance service count using Consul as service provider
    • Update the Job or Remove the Job
    • Redeploy the Nomad job and observe Consul's service catalog to note the older instances were never properly deregistered.
  • Nomad Client logs for the allocation as it's shutting down
    • Will send separately via Slack and ZD
  • Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)
    • Sending separately via Slack and ZD

Will work to get a working reproduction on the side to deliver if needed just let me know in Slack. Wanted to post here to keep track.

@ngcmac
Copy link

ngcmac commented Jul 4, 2024

Hi @tgross ,

We also still have this issue in our CI environment (1 server nomad + consul + vault + 3 clients).
We start to observe an increase on this after we enabled WI for Consul.

  • Which service provider you're seeing the problem with ("consul" or "nomad"): Consul

  • Which version of Nomad you're using: 1.8.1

  • Which version of Consul you're using (if using the "consul" service provider): 1.18.1

  • Your Nomad agent's consul config block and Consul agent's config (if using the "consul" service provider):

    • Nomad:
    {
      "consul": {
        "address": "127.0.0.1:8501",
        "grpc_address": "127.0.0.1:8503",
        "auto_advertise": true,
        "server_auto_join": true,
        "client_auto_join": true,
        "token": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
        "ssl": true,
        "ca_file": "/opt/consul/certs/consul-ca-cert.pem",
        "cert_file": "/opt/consul/certs/consul-server-cert.pem",
        "key_file": "/opt/consul/certs/consul-server-key.pem",
        "grpc_ca_file": "/opt/consul/certs/consul-ca-cert.pem",
        "verify_ssl": false
      }
    }
    
    • Consul (default config + acl):
    "node_name": "<node_name>",
    "addresses":{
      "http": "0.0.0.0",
      "https": "0.0.0.0",
      "dns": "0.0.0.0",
      "grpc_tls": "0.0.0.0"
    },
    "data_dir": "/opt/consul/data",
    "log_level": "INFO",
    "datacenter": "dc1",
    "bind_addr": "<ip_addr>",
    "advertise_addr": "<ip_addr>",
    "ports": {
      "server": 8300,
      "serf_lan": 8301,
      "serf_wan": 8302,
      "http": 8500,
      "https": 8501,
      "grpc_tls": 8503,
      "dns": 8600,
      "sidecar_min_port": 21000,
      "sidecar_max_port": 21255,
      "expose_min_port": 21500,
      "expose_max_port":  21755
    },
    "server": false,
    "retry_join": ["<consul_server>"],
    "rejoin_after_leave": true,
    "connect": {
      "enabled": true
    },
    "encrypt": "<some_secret>",
    "tls": {
      "defaults": {
        "ca_file": "/opt/consul/certs/consul-ca-cert.pem",
        "cert_file": "/opt/consul/certs/consul-server-cert.pem",
        "key_file": "/opt/consul/certs/consul-server-key.pem",
        "verify_incoming": true,
        "verify_outgoing": true
      },
      "https": {
        "verify_incoming": true
      },
      "internal_rpc": {
        "verify_server_hostname": true,
        "verify_incoming": true
      },
      "grpc": {
        "verify_incoming": false
      }
    },
    "auto_encrypt": {
      "tls": true
    },
    "limits": {
      "http_max_conns_per_client": 8000,
      "https_handshake_timeout": "15s",
      "rpc_max_conns_per_client": 400,
      "rpc_handshake_timeout": "15s"
    }
    
    "acl": {
      "enabled": true,
      "tokens": {
        "agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      },
      "default_policy": "deny",
      "down_policy": "extend-cache",
      "enable_token_persistence": true
    }
    
  • The sequence of events as much as you know (ex. "the Nomad agent was restarted, then I submitted a new version of the job and the allocation was replaced")

    • A Nomad job is submitted and the service registered on Consul during the startup in an unhealthy state. For some reason (missing variables, secrets, code problem, etc...), the allocation gets unhealthy on Nomad and restarts. The allocation keeps restarting according with the restart policy. When allowed attempts are exceeded, Nomad "kills" the Job but is unable to deregister the service from Consul. We can see "ACL.not.found" errors on Consul client. We also see an increase on nomad.client.consul.sync_failure metric (was 0 before ). I need to use a management token to manually unregister it or restart the nomad node where the allocation started on the first place and got unhealthy. This keeps repeating over and over while the alloc restarts (and may happen on different nodes, using "SchedulerAlgorithm": "spread").
  • Client logs for the allocation as it's shutting down (in this case a new deploy for an existent job is submitted with allowed restarts 3)

Nomad:
2024-07-04T08:13:06.785415+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.785Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Received msg="Task received by client" failed=false
2024-07-04T08:13:06.820716+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.820Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Task Setup" msg="Building Task Directory" failed=false
2024-07-04T08:13:06.888908+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:06.888Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
2024-07-04T08:13:09.227594+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.227Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2024-07-04T08:13:09.228156+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.228Z [WARN]  agent: (clients) disabling consul SSL verification
2024-07-04T08:13:09.228795+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.228Z [INFO]  agent: (runner) creating watcher
2024-07-04T08:13:09.229282+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.229Z [INFO]  agent: (runner) starting
2024-07-04T08:13:09.250898+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.250Z [WARN]  agent: (runner) watching 186 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.291654+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.291Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logback.xml"
2024-07-04T08:13:09.316109+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.315Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/logrotate"
2024-07-04T08:13:09.344981+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.344Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/envs"
2024-07-04T08:13:09.351655+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.351Z [WARN]  agent: (runner) watching 187 dependencies - watching this many dependencies could DDoS your servers
2024-07-04T08:13:09.380496+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.380Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/backoffice-parking-consumer.conf"
2024-07-04T08:13:09.494995+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.494Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/30f0b098-79ae-c813-9e38-4957fd2f336e/backoffice-parking-consumer/local/main.properties"
2024-07-04T08:13:09.674300+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:09.674Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:13:16.995273+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:16.995Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.440496+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:17.439Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=70f26f726a384f5be291ccd03e6cc5a73d48fd40f640ae622d168821e30415fe
2024-07-04T08:13:17.493295+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:17.493Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:13:36.986712+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:36.986Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:13:36.993466+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:36.993Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2596806
2024-07-04T08:13:37.006807+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:37.006Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.272860626s
2024-07-04T08:13:37.007018+00:00 <node_name> nomad[4008547]:     2024-07-04T08:13:37.006Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.272860626s" failed=fals
2024-07-04T08:14:12.446927+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:12.446Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:14:12.907417+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:12.907Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.405087+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:13.404Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=dbbab1efe68e172a7e0f56319fad010db22b626ce93f730dd11487e963ed6c8a
2024-07-04T08:14:13.463133+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:13.463Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:14:32.409339+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.408Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:14:32.416647+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.416Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2604407
2024-07-04T08:14:32.428570+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.428Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=35.718148757s
2024-07-04T08:14:32.428710+00:00 <node_name> nomad[4008547]:     2024-07-04T08:14:32.428Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 35.718148757s" failed=false
2024-07-04T08:15:08.255694+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:08.254Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:15:08.761774+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:08.761Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.206826+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:09.206Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=68ba996f95abdf00048c15a0d06cb327341915ed2217cfac09684309772d5cc2
2024-07-04T08:15:09.257945+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:09.257Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:15:27.751406+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.751Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:15:27.758280+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.758Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2611952
2024-07-04T08:15:27.769543+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.769Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Restart within policy" delay=32.650043057s
2024-07-04T08:15:27.769613+00:00 <node_name> nomad[4008547]:     2024-07-04T08:15:27.769Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Restarting msg="Task restarting in 32.650043057s" failed=false
2024-07-04T08:16:00.454924+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:00.454Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Driver msg="Downloading image" failed=false
2024-07-04T08:16:00.873403+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:00.872Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.344882+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:01.344Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=895827c73be9faa3e76bf68ca615c445307ca3cf8b6f0db1d94baeee20fdd920
2024-07-04T08:16:01.406204+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:01.406Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Started msg="Task started by client" failed=false
2024-07-04T08:16:19.587339+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.587Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type=Terminated msg="Exit Code: 0" failed=false
2024-07-04T08:16:19.593708+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.593Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/opt/nomad/bin/nomad id=2618891
2024-07-04T08:16:19.604943+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.604Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer reason="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\""
2024-07-04T08:16:19.605031+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.604Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer type="Not Restarting" msg="Exceeded allowed attempts 3 in interval 30m0s and mode is \"fail\"" failed=true
2024-07-04T08:16:19.614604+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.614Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e task=backoffice-parking-consumer plugin=/opt/nomad/bin/nomad id=2595171
2024-07-04T08:16:19.614846+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.614Z [INFO]  agent: (runner) stopping
2024-07-04T08:16:19.616711+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.616Z [INFO]  client.gc: marking allocation for GC: alloc_id=30f0b098-79ae-c813-9e38-4957fd2f336e
2024-07-04T08:16:19.616802+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.616Z [INFO]  agent: (runner) received finish
2024-07-04T08:16:19.639835+00:00 <node_name> nomad[4008547]:     2024-07-04T08:16:19.637Z [WARN]  consul.sync: failed to update services in Consul: error="Unexpected response code: 403 (ACL not found)"
Consul:
2024-07-04T08:13:06.832454+00:00 <node_name> consul[4075309]: 2024-07-04T08:13:06.832Z [INFO]  agent: Synced service: service=_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer-
2024-07-04T08:16:19.637477+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:19.639781+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:19.637Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:20.643069+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:20.643185+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:20.642Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:22.649259+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:22.649380+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:22.648Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:25.657196+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:25.657348+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:25.656Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:29.664423+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:29.664523+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:29.664Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:34.670396+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:34.670535+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:34.669Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:40.676407+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:40.676589+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:40.675Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:47.682542+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.681Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:47.682691+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:47.682Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
2024-07-04T08:16:55.689246+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=<server_ip>:8300 error="rpc error making call: token does not exist: ACL not found"
2024-07-04T08:16:55.689333+00:00 <node_name> consul[4075309]: 2024-07-04T08:16:55.688Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-30f0b098-79ae-c813-9e38-4957fd2f336e-group-backoffice-parking-consumer-backoffice-parking-consumer- from=127.0.0.1:35434 error="ACL not found"
  • Any server logs that include errors around the deregistration process or other triggering event (ex. if the node was marked "down", provide any logs you have around that event too)
  • Nothing with "log_level": "INFO"

Thanks

@tgross
Copy link
Member

tgross commented Jul 8, 2024

@ngcmac I'm going to continue the investigation for the case where we're getting "ACL not found" with Workload Identities in #23494. @natemollica-nm is going to do some follow-up on his report and we may or may not adjust #23494 to cover that case as well depending on the outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment