Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services not unregistered from Consul #17079

Closed
suikast42 opened this issue May 4, 2023 · 44 comments
Closed

Services not unregistered from Consul #17079

suikast42 opened this issue May 4, 2023 · 44 comments

Comments

@suikast42
Copy link
Contributor

suikast42 commented May 4, 2023

The issue #16616 is marked as solved. But I have testest with nomad 1.5.1 - 1.5.4 In all the version the same issue is present

Dead allow with diffrent port as previous alloc
image

Dead allow with same port as previous alloc
image

If I run the script below the dead allocs which are unhealthy dissaperas from consul but registred again by nomad after few seconds.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

EDIT: I stopped the job with nomad job stop --purge event then nomad registers all the dead allocs

All allocs not present after purge the job

image

Nomad tries to check dead alloc

image

Edit 2:

Restarting the nomad service on the worker after boot is a workarround but definity not for production.

@shoenig shoenig self-assigned this May 4, 2023
@suikast42
Copy link
Contributor Author

suikast42 commented May 4, 2023

Here is a full log of a zobie alloc

Common labels: {"host_hostname":"worker-01","host_id":"ceacb99587e34bcc840bc7a7cc0d4453","host_name":"worker-01","ingress":"ingress.logs.journald","job_type":"daemon","labels_host_id":"ceacb99587e34bcc840bc7a7cc0d4453","labels_host_name":"worker-01","labels_ingress":"ingress.logs.journald","labels_job_type":"daemon","labels_used_grok":"TsLevelMsg","namespace":"NotDefined","processError":"false","stack":"NotDefined","task":"NotDefined","task_group":"NotDefined","used_grok":"TsLevelMsg"}
Line limit: "2100 (87 returned)"
Total bytes processed: "2.87  MB"


2023-05-04T22:55:42+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:55:22+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:54:22+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:54:02+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:53:02+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:52:42+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:51:42+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:51:22+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:50:21+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:50:01+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:49:01+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:48:41+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:47:41+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:47:21+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:46:21+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:46:01+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/usr/local/bin/nomad pid=15192      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner: task run loop exiting: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.gc: marking allocation for GC: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo err="rpc error: code = Unavailable desc = error reading from server: EOF"      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: not restarting task: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo reason="Exceeded allowed attempts 1 in interval 1h0m0s and mode is \"fail\""      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type="Not Restarting" msg="Exceeded allowed attempts 1 in interval 1h0m0s and mode is \"fail\"" failed=true      
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false      
2023-05-04T22:45:01+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type="Restart Signaled" msg="healthcheck: check \"health\" unhealthy" failed=false      
2023-05-04T22:45:01+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:44:41+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Started msg="Task started by client" failed=false      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: binding directories: driver=docker task_name=tempo binds="[]string{\"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc:/alloc\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local:/local\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/secrets:/secrets\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml:/config/tempo.yaml\"}"      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: setting container name: driver=docker task_name=tempo container_name=tempo-c6e58956-f87d-7750-6008-2a0336c02835      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=tempo labels="map[com.github.logunifier.application.name:tempo com.github.logunifier.application.pattern.key:logfmt com.github.logunifier.application.version:2.1.1 com.hashicorp.nomad.alloc_id:c6e58956-f87d-7750-6008-2a0336c02835 com.hashicorp.nomad.job_id:observability com.hashicorp.nomad.job_name:observability com.hashicorp.nomad.namespace:default com.hashicorp.nomad.node_id:2b58d2b0-22e5-eab1-066c-5c2c1cdfc1da com.hashicorp.nomad.node_name:worker-01 com.hashicorp.nomad.task_group_name:tempo com.hashicorp.nomad.task_name:tempo]"      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo @module=logmon path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stdout.fifo timestamp=2023-05-04T20:43:51.611Z      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stderr.fifo @module=logmon timestamp=2023-05-04T20:43:51.611Z      
2023-05-04T22:43:51+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo      
2023-05-04T22:43:46+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Restarting msg="Task restarting in 5.56084366s" failed=false      
2023-05-04T22:43:46+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: restarting task: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo reason="Restart within policy" delay=5.56084366s      
2023-05-04T22:43:46+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false      
2023-05-04T22:43:41+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-grpc-otlp_grpc      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-jaeger-jaeger      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-http-otlp_http      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-zipkin-zipkin      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-tempo      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Deregistered service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-tempo      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Deregistered service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-http-otlp_http      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Deregistered service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-zipkin-zipkin      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Deregistered service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-jaeger-jaeger      
2023-05-04T22:43:40+02:00	[consul.service 💻 worker-01] [✅]  agent: Deregistered service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-grpc-otlp_grpc      
2023-05-04T22:43:40+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type="Restart Signaled" msg="healthcheck: check \"health\" unhealthy" failed=false      
2023-05-04T22:43:40+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:43:20+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Started msg="Task started by client" failed=false      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: setting container name: driver=docker task_name=tempo container_name=tempo-c6e58956-f87d-7750-6008-2a0336c02835      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=tempo labels="map[com.github.logunifier.application.name:tempo com.github.logunifier.application.pattern.key:logfmt com.github.logunifier.application.version:2.1.1 com.hashicorp.nomad.alloc_id:c6e58956-f87d-7750-6008-2a0336c02835 com.hashicorp.nomad.job_id:observability com.hashicorp.nomad.job_name:observability com.hashicorp.nomad.namespace:default com.hashicorp.nomad.node_id:2b58d2b0-22e5-eab1-066c-5c2c1cdfc1da com.hashicorp.nomad.node_name:worker-01 com.hashicorp.nomad.task_group_name:tempo com.hashicorp.nomad.task_name:tempo]"      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: binding directories: driver=docker task_name=tempo binds="[]string{\"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc:/alloc\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local:/local\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/secrets:/secrets\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml:/config/tempo.yaml\"}"      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stderr.fifo @module=logmon timestamp=2023-05-04T20:43:11.048Z      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo @module=logmon path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stdout.fifo timestamp=2023-05-04T20:43:11.048Z      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: restarting task: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo reason="" delay=0s      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Restarting msg="Task restarting in 0s" failed=false      
2023-05-04T22:43:11+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false      
2023-05-04T22:43:05+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false      
2023-05-04T22:43:02+02:00	[nomad.service 💻 worker-01] [✅]  agent: (runner) rendered "(dynamic)" => "/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml"      
2023-05-04T22:43:02+02:00	[nomad.service 💻 worker-01] [🐞] agent: (runner) rendering "(dynamic)" => "/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml"      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Started msg="Task started by client" failed=false      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: binding directories: driver=docker task_name=tempo binds="[]string{\"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc:/alloc\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local:/local\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/secrets:/secrets\", \"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml:/config/tempo.yaml\"}"      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=tempo labels="map[com.github.logunifier.application.name:tempo com.github.logunifier.application.pattern.key:logfmt com.github.logunifier.application.version:2.1.1 com.hashicorp.nomad.alloc_id:c6e58956-f87d-7750-6008-2a0336c02835 com.hashicorp.nomad.job_id:observability com.hashicorp.nomad.job_name:observability com.hashicorp.nomad.namespace:default com.hashicorp.nomad.node_id:2b58d2b0-22e5-eab1-066c-5c2c1cdfc1da com.hashicorp.nomad.node_name:worker-01 com.hashicorp.nomad.task_group_name:tempo com.hashicorp.nomad.task_name:tempo]"      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.driver_mgr.docker: setting container name: driver=docker task_name=tempo container_name=tempo-c6e58956-f87d-7750-6008-2a0336c02835      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [✅]  agent: (runner) rendered "(dynamic)" => "/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml"      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] agent: (runner) rendering "(dynamic)" => "/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml"      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] agent: (runner) final config: {"Consul":{"Address":"127.0.0.1:8501","Namespace":"","Auth":{"Enabled":false,"Username":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem","CaPath":"","Cert":"/etc/opt/certs/consul/consul.pem","Enabled":true,"Key":"/etc/opt/certs/consul/consul-key.pem","ServerName":"","Verify":true},"Token":"","TokenFile":"","Transport":{"CustomDialer":null,"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":13,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000,"BlockQueryWaitTime":60000000000},"DefaultDelims":{"Left":null,"Right":null},"Exec":{"Command":[],"Enabled":false,"Env":{"Denylist":[],"Custom":[],"Pristine":false,"Allowlist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","FileLog":{"LogFilePath":"","LogRotateBytes":0,"LogRotateDuration":86400000000000,"LogRotateMaxFiles":0},"MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0","Name":"consul-template"},"Templates":[{"Backup":false,"Command":[],"CommandTimeout":30000000000,"Contents":"multitenancy_enabled: false\n\nserver:\n  http_listen_port: 3200\n\ndistributor:\n  receivers:                           # this configuration will listen on all ports and protocols that tempo is capable of.\n    jaeger:                            # the receives all come from the OpenTelemetry collector.  more configuration information can\n      protocols:                       # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver\n        thrift_http:                   #\n        grpc:                          # for a production deployment you should only enable the receivers you need!\n        thrift_binary:\n        thrift_compact:\n    zipkin:\n    otlp:\n      protocols:\n        http:\n        grpc:\n    opencensus:\n\ningester:\n  trace_idle_period: 10s               # the length of time after a trace has not received spans to consider it complete and flush it\n  max_block_bytes: 1_000_000           # cut the head block when it hits this size or ...\n  max_block_duration: 5m               #   this much time passes\n\ncompactor:\n  compaction:\n    compaction_window: 1h              # blocks in this time window will be compacted together\n    max_block_bytes: 100_000_000       # maximum size of compacted blocks\n    block_retention: 24h               # Duration to keep blocks 1d\n\nmetrics_generator:\n  registry:\n    external_labels:\n      source: tempo\n      cluster: nomadder1\n  storage:\n    path: /data/generator/wal\n    remote_write:\n++- range service \"mimir\" ++\n      - url: http://++.Name++.service.consul:++.Port++/api/v1/push\n        send_exemplars: true\n        headers:\n          x-scope-orgid: 1\n++- end ++\n\nstorage:\n  trace:\n    backend: local                     # backend configuration to use\n    block:\n      bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives\n    wal:\n      path: /data/wal             # where to store the the wal locally\n    local:\n      path: /data/blocks\n    pool:\n      max_workers: 100                 # worker pool determines the number of parallel requests to the object store backend\n      queue_depth: 10000\n\nquery_frontend:\n  search:\n    # how to define year here ? define 5 years\n    max_duration: 43800h\n\noverrides:\n  metrics_generator_processors: [service-graphs, span-metrics]","CreateDestDirs":true,"Destination":"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo/local/tempo.yaml","ErrMissingKey":false,"ErrFatal":true,"Exec":{"Command":[],"Enabled":false,"Env":{"Denylist":[],"Custom":[],"Pristine":false,"Allowlist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"User":null,"Uid":null,"Group":null,"Gid":null,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"++","RightDelim":"++","FunctionDenylist":["plugin","writeToFile"],"SandboxPath":"/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/tempo"}],"TemplateErrFatal":null,"Vault":{"Address":"","Enabled":false,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":true,"Key":"","ServerName":"","Verify":true},"Transport":{"CustomDialer":null,"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":13,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false,"DefaultLeaseDuration":300000000000,"LeaseRenewalThreshold":0.9,"K8SAuthRoleName":"","K8SServiceAccountTokenPath":"/run/secrets/kubernetes.io/serviceaccount/token","K8SServiceAccountToken":"","K8SServiceMountPath":"kubernetes"},"Nomad":{"Address":"","Enabled":true,"Namespace":"default","SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"AuthUsername":"","AuthPassword":"","Transport":{"CustomDialer":{},"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":13,"TLSHandshakeTimeout":10000000000},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true}},"Wait":{"Enabled":false,"Min":0,"Max":0},"Once":false,"ParseOnly":false,"BlockQueryWaitTime":60000000000}      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo @module=logmon path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stderr.fifo timestamp=2023-05-04T20:42:20.349Z      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo @module=logmon path=/opt/services/core/nomad/data/alloc/c6e58956-f87d-7750-6008-2a0336c02835/alloc/logs/.tempo.stdout.fifo timestamp=2023-05-04T20:42:20.348Z      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo network=unix @module=logmon address=/tmp/plugin3016923748 timestamp=2023-05-04T20:42:20.346Z      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo version=2      
2023-05-04T22:42:20+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-grpc-otlp_grpc      
2023-05-04T22:42:20+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-jaeger-jaeger      
2023-05-04T22:42:20+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-otlp-http-otlp_http      
2023-05-04T22:42:20+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-zipkin-zipkin      
2023-05-04T22:42:20+02:00	[consul.service 💻 worker-01] [✅]  agent: Synced service: service=_nomad-task-c6e58956-f87d-7750-6008-2a0336c02835-group-tempo-tempo-tempo      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/usr/local/bin/nomad args=["/usr/local/bin/nomad", "logmon"]      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/usr/local/bin/nomad      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo path=/usr/local/bin/nomad pid=15192      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type="Task Setup" msg="Building Task Directory" failed=false      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo      
2023-05-04T22:42:20+02:00	[nomad.service 💻 worker-01] [🐞] client.alloc_runner.runner_hook: received result from CNI: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 result="{\"Interfaces\":{\"eth0\":{\"IPConfigs\":[{\"IP\":\"172.26.73.193\",\"Gateway\":\"172.26.64.1\"}],\"Mac\":\"9a:48:86:86:5f:55\",\"Sandbox\":\"/var/run/docker/netns/14d92a9468e4\"},\"nomad\":{\"IPConfigs\":null,\"Mac\":\"b6:c6:09:46:cb:55\",\"Sandbox\":\"\"},\"veth0509915d\":{\"IPConfigs\":null,\"Mac\":\"9a:a1:fb:4d:c9:87\",\"Sandbox\":\"\"}},\"DNS\":[{}],\"Routes\":[{\"dst\":\"0.0.0.0/0\"}]}"      
2023-05-04T22:42:18+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Received msg="Task received by client" failed=false      

@suikast42
Copy link
Contributor Author

I updated to nomad 1.5.5 but this bug stays.

suikast42 pushed a commit to suikast42/nomadder that referenced this issue May 6, 2023
@suikast42
Copy link
Contributor Author

Ok. I have put in a little effort to get to the bottom of it.

I enable the drain_on_shutdown on the agent node with

  drain_on_shutdown {
    deadline           = "2m"
    force              = false
    ignore_system_jobs = true
  }

But that's dont make the situation better. My epectation is that the drain_on_shutdown drains all allocs and shudiwn the agent after that. But it does not tollerate the 2m deadline.

My second approach is to call a script over systemd Stop hook.

ExecStop={{nomad_conf_dir}}/nomad_node_drain.sh

nomad_node_drain.sh

#!/bin/bash

if [ ! -f "/home/{{ansible_user}}/notdrain" ] ; then nomad node drain -enable -self -deadline "2m"  -m "Node shutdown" -yes -address=https://localhost:4646 -ca-cert=/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem -client-cert=/etc/opt/certs/nomad/nomad.pem -client-key=/etc/opt/certs/nomad/nomad-key.pem;fi

With that approch I have never seen the dead allocations. But I must start the master node first. Otherwise the issue is the same.

The approach to drain the node over script comes with the caveat that the node is not eligable after boot.

my workarround for make the node eligable again is described here.

#17093

@tgross
Copy link
Member

tgross commented May 8, 2023

@suikast42 still the wrong issue, I think. Please pay attention to where you're posting.

@suikast42
Copy link
Contributor Author

@suikast42 still the wrong issue, I think. Please pay attention to where you're posting.

IMHO not. This solves my problem particularly. Which topic are you suggest ?

@shoenig
Copy link
Member

shoenig commented May 8, 2023

It is interesting in those logs that we see

(reverse chronological)

2023-05-04T22:55:42+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: restarting due to unhealthy check: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo      
2023-05-04T22:55:22+02:00	[nomad.service 💻 worker-01] [🐞] watch.checks: check became unhealthy. Will restart if check doesn't become healthy: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 check=health task=group-tempo time_limit=20s
...
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.alloc_runner.task_runner: Task event: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835 task=tempo type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false      
...
2023-05-04T22:45:06+02:00	[nomad.service 💻 worker-01] [✅]  client.gc: marking allocation for GC: alloc_id=c6e58956-f87d-7750-6008-2a0336c02835

with nothing following the Killing task event for ~10 minutes, which makes me wonder if a hook is getting stuck.

@suikast42 any chance you can produce a goroutine dump of the Nomad process when one of them gets into this state? And can you provide as much of a real job spec as you can?

@suikast42
Copy link
Contributor Author

And can you provide as much of a real job spec as you can?

If you tell me how. Sure ✌

@shoenig
Copy link
Member

shoenig commented May 8, 2023

You can just use nomad job inspect <jobID> to dump the JSON representation of the job.

For the goroutine dump, just find the nomad process and send SIGQUIT

➜ ps -ef | grep nomad
...
root      129259  129258  1 15:35 pts/5    00:00:03 nomad agent -config=...
➜ sudo kill -SIGQUIT 129259

The standard out/err logs of the Nomad client should then contain a whole bunch of goroutine stack trace information. That should at least help us know if a hook is stuck waiting on something.

@suikast42
Copy link
Contributor Author

{
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": null,
        "ConsulNamespace": "",
        "ConsulToken": "",
        "CreateIndex": 97,
        "Datacenters": [
            "nomadder1"
        ],
        "DispatchIdempotencyToken": "",
        "Dispatched": false,
        "ID": "observability",
        "JobModifyIndex": 97,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 404,
        "Multiregion": null,
        "Name": "observability",
        "Namespace": "default",
        "NomadTokenID": "",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Spreads": null,
        "Stable": true,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1683559547429931764,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "grafana",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "ui",
                                "To": 3000,
                                "Value": 0
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "connect-proxy-grafana",
                                "To": -1,
                                "Value": 0
                            }
                        ],
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": null
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "health",
                                "OnUpdate": "require_healthy",
                                "Path": "/healthz",
                                "PortLabel": "ui",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": {
                            "Gateway": null,
                            "Native": false,
                            "SidecarService": {
                                "DisableDefaultTCPCheck": false,
                                "Meta": null,
                                "Port": "",
                                "Proxy": null,
                                "Tags": null
                            },
                            "SidecarTask": {
                                "Config": {
                                    "labels": [
                                        {
                                            "com.github.logunifier.application.pattern.key": "envoy"
                                        }
                                    ]
                                },
                                "Driver": "",
                                "Env": null,
                                "KillSignal": "",
                                "KillTimeout": 5000000000,
                                "LogConfig": {
                                    "Disabled": false,
                                    "Enabled": null,
                                    "MaxFileSizeMB": 10,
                                    "MaxFiles": 10
                                },
                                "Meta": null,
                                "Name": "",
                                "Resources": {
                                    "CPU": 100,
                                    "Cores": 0,
                                    "Devices": null,
                                    "DiskMB": 0,
                                    "IOPS": 0,
                                    "MemoryMB": 300,
                                    "MemoryMaxMB": 0,
                                    "Networks": null
                                },
                                "ShutdownDelay": 0,
                                "User": ""
                            }
                        },
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "grafana",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "3000",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": [
                            "traefik.enable=true",
                            "traefik.consulcatalog.connect=true",
                            "traefik.http.routers.grafana.tls=true",
                            "traefik.http.routers.grafana.rule=Host(`grafana.cloud.private`)"
                        ],
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "ports": [
                                "ui"
                            ],
                            "image": "registry.cloud.private/stack/observability/grafana:9.5.1.0",
                            "labels": [
                                {
                                    "com.github.logunifier.application.pattern.key": "logfmt",
                                    "com.github.logunifier.application.version": "9.5.1.0",
                                    "com.github.logunifier.application.name": "grafana"
                                }
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "GF_AUTH_OAUTH_AUTO_LOGIN": "true",
                            "GF_PATHS_CONFIG": "/etc/grafana/grafana2.ini",
                            "GF_PATHS_PLUGINS": "/data/grafana/plugins",
                            "GF_SERVER_DOMAIN": "grafana.cloud.private",
                            "GF_SERVER_ROOT_URL": "https://grafana.cloud.private",
                            "GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH": "contains(realm_access.roles[*], 'admin') && 'GrafanaAdmin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
                        },
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "grafana",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "MemoryMaxMB": 4096,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "${NOMAD_SECRETS_DIR}/env.vars",
                                "EmbeddedTmpl": "          {{ with nomadVar \"nomad/jobs/observability\" }}\n            GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET    = {{.keycloak_secret_observability_grafana}}\n          {{ end }}\n",
                                "Envvars": true,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/var/lib/grafana",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_grafana_volume"
                            }
                        ]
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "labels": [
                                {
                                    "com.github.logunifier.application.pattern.key": "envoy"
                                }
                            ],
                            "image": "${meta.connect.sidecar_image}",
                            "args": [
                                "-c",
                                "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
                                "-l",
                                "${meta.connect.log_level}",
                                "--concurrency",
                                "${meta.connect.proxy_concurrency}",
                                "--disable-hot-restart"
                            ]
                        },
                        "Constraints": [
                            {
                                "LTarget": "${attr.consul.version}",
                                "Operand": "semver",
                                "RTarget": ">= 1.8.0"
                            },
                            {
                                "LTarget": "${attr.consul.grpc}",
                                "Operand": ">",
                                "RTarget": "0"
                            }
                        ],
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "connect-proxy:grafana",
                        "Leader": false,
                        "Lifecycle": {
                            "Hook": "prestart",
                            "Sidecar": true
                        },
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "connect-proxy-grafana",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 300,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "stack_observability_grafana_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_grafana_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_grafana_volume",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "mimir",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": null,
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "api",
                                "To": 9009,
                                "Value": 9009
                            }
                        ]
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "health",
                                "OnUpdate": "require_healthy",
                                "Path": "/ready",
                                "PortLabel": "api",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "mimir",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "api",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-config.file",
                                "/config/mimir.yaml",
                                "-config.expand-env",
                                "true"
                            ],
                            "image": "registry.cloud.private/grafana/mimir:2.8.0",
                            "labels": [
                                {
                                    "com.github.logunifier.application.version": "2.8.0",
                                    "com.github.logunifier.application.name": "mimir",
                                    "com.github.logunifier.application.pattern.key": "logfmt"
                                }
                            ],
                            "ports": [
                                "api"
                            ],
                            "volumes": [
                                "local/mimir.yml:/config/mimir.yaml"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "JAEGER_ENDPOINT": "http://tempo-jaeger.service.consul:14268/api/traces?format=jaeger.thrift",
                            "JAEGER_REPORTER_LOG_SPANS": "true",
                            "JAEGER_SAMPLER_PARAM": "1",
                            "JAEGER_SAMPLER_TYPE": "const",
                            "JAEGER_TRACEID_128BIT": "true"
                        },
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "mimir",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "MemoryMaxMB": 32768,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "local/mimir.yml",
                                "EmbeddedTmpl": "\n# Test ++ env \"NOMAD_ALLOC_NAME\"  ++\n# Do not use this configuration in production.\n# It is for demonstration purposes only.\n\n# Run Mimir in single process mode, with all components running in 1 process.\ntarget: all,alertmanager,overrides-exporter\n# Disable tendency support.\nmultitenancy_enabled: false\n\nserver:\n  http_listen_port: 9009\n  log_level: debug\n  # Configure the server to allow messages up to 100MB.\n  grpc_server_max_recv_msg_size: 104857600\n  grpc_server_max_send_msg_size: 104857600\n  grpc_server_max_concurrent_streams: 1000\n\nblocks_storage:\n  backend: filesystem\n  bucket_store:\n    sync_dir: /data/tsdb-sync\n   #ignore_blocks_within: 10h # default 10h\n  filesystem:\n    dir: /data/blocks\n  tsdb:\n    dir: /data/tsdb\n    # Note that changing this requires changes to some other parameters like\n    # -querier.query-store-after,\n    # -querier.query-ingesters-within and\n    # -blocks-storage.bucket-store.ignore-blocks-within.\n    # retention_period: 24h # default 24h\nquerier:\n  # query_ingesters_within: 13h # default 13h\n  #query_store_after: 12h #default 12h\nruler_storage:\n  backend: filesystem\n  filesystem:\n    dir: /data/rules\n\nalertmanager_storage:\n  backend: filesystem\n  filesystem:\n    dir: /data/alarms\n\nfrontend:\n  grpc_client_config:\n    grpc_compression: snappy\n\nfrontend_worker:\n  grpc_client_config:\n    grpc_compression: snappy\n\ningester_client:\n  grpc_client_config:\n    grpc_compression: snappy\n\nquery_scheduler:\n  grpc_client_config:\n    grpc_compression: snappy\n\nalertmanager:\n  data_dir: /data/alertmanager\n#  retention: 120h\n  sharding_ring:\n    replication_factor: 1\n  alertmanager_client:\n    grpc_compression: snappy\n\nruler:\n  query_frontend:\n    grpc_client_config:\n      grpc_compression: snappy\n\ncompactor:\n#  compaction_interval: 1h # default 1h\n#  deletion_delay: 12h # default 12h\n  max_closing_blocks_concurrency: 2\n  max_opening_blocks_concurrency: 4\n  symbols_flushers_concurrency: 4\n  data_dir: /data/compactor\n  sharding_ring:\n    kvstore:\n      store: memberlist\n\n\ningester:\n  ring:\n    replication_factor: 1\n\nstore_gateway:\n  sharding_ring:\n    replication_factor: 1\n\nlimits:\n  # Limit queries to 5 years. You can override this on a per-tenant basis.\n  max_total_query_length: 43680h\n  max_label_names_per_series: 42\n  # Allow ingestion of out-of-order samples up to 2 hours since the latest received sample for the series.\n  out_of_order_time_window: 1d\n  # delete old blocks from long-term storage.\n  # Delete from storage metrics data older than 1d.\n  compactor_blocks_retention_period: 1d\n  ingestion_rate: 100000",
                                "Envvars": false,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "++",
                                "Perms": "0644",
                                "RightDelim": "++",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/data",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_mimir_volume"
                            }
                        ]
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "stack_observability_mimir_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_mimir_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_mimir_volume",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "loki",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": null,
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "http",
                                "To": 3100,
                                "Value": 3100
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "cli",
                                "To": 7946,
                                "Value": 7946
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "grpc",
                                "To": 9095,
                                "Value": 9005
                            }
                        ]
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "health",
                                "OnUpdate": "require_healthy",
                                "Path": "/ready",
                                "PortLabel": "http",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "loki",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "http",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": [
                            "prometheus",
                            "prometheus:server_id=${NOMAD_ALLOC_NAME}",
                            "prometheus:version=2.9.16"
                        ],
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "labels": [
                                {
                                    "com.github.logunifier.application.name": "loki",
                                    "com.github.logunifier.application.pattern.key": "logfmt",
                                    "com.github.logunifier.application.version": "2.8.2"
                                }
                            ],
                            "ports": [
                                "http",
                                "cli",
                                "grpc"
                            ],
                            "volumes": [
                                "local/loki.yaml:/config/loki.yaml"
                            ],
                            "args": [
                                "-config.file",
                                "/config/loki.yaml",
                                "-config.expand-env",
                                "true",
                                "-print-config-stderr",
                                "true"
                            ],
                            "image": "registry.cloud.private/grafana/loki:2.8.2"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "JAEGER_SAMPLER_PARAM": "1",
                            "JAEGER_SAMPLER_TYPE": "const",
                            "JAEGER_TRACEID_128BIT": "true",
                            "JAEGER_ENDPOINT": "http://tempo-jaeger.service.consul:14268/api/traces?format=jaeger.thrift",
                            "JAEGER_REPORTER_LOG_SPANS": "true"
                        },
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "loki",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "MemoryMaxMB": 32768,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "local/loki.yaml",
                                "EmbeddedTmpl": "auth_enabled: false\n\nserver:\n  #default 3100\n  http_listen_port: 3100\n  #default 9005\n  #grpc_listen_port: 9005\n  # Max gRPC message size that can be received\n  # CLI flag: -server.grpc-max-recv-msg-size-bytes\n  #default 4194304 -> 4MB\n  grpc_server_max_recv_msg_size: 419430400\n\n  # Max gRPC message size that can be sent\n  # CLI flag: -server.grpc-max-send-msg-size-bytes\n  #default 4194304 -> 4MB\n  grpc_server_max_send_msg_size:  419430400\n\n  # Limit on the number of concurrent streams for gRPC calls (0 = unlimited)\n  # CLI flag: -server.grpc-max-concurrent-streams\n  grpc_server_max_concurrent_streams:  100\n\n  # Log only messages with the given severity or above. Supported values [debug,\n  # info, warn, error]\n  # CLI flag: -log.level\n  log_level: \"warn\"\ningester:\n  wal:\n    enabled: true\n    dir: /data/wal\n  lifecycler:\n    address: 127.0.0.1\n    ring:\n      kvstore:\n        store: memberlist\n      replication_factor: 1\n    final_sleep: 0s\n  chunk_idle_period: 5m\n  chunk_retain_period: 30s\n  chunk_encoding: snappy\n\nruler:\n  evaluation_interval : 1m\n  poll_interval: 1m\n  storage:\n    type: local\n    local:\n      directory: /data/rules\n  rule_path: /data/scratch\n++- range  $index, $service := service \"mimir\" -++\n++- if eq $index 0 ++\n  alertmanager_url: http://++$service.Name++.service.consul:++ $service.Port ++/alertmanager\n++- end ++\n++- end ++\n\n  ring:\n    kvstore:\n      store: memberlist\n  enable_api: true\n  enable_alertmanager_v2: true\n\ncompactor:\n  working_directory: /data/retention\n  shared_store: filesystem\n  compaction_interval: 10m\n  retention_enabled: true\n  retention_delete_delay: 2h\n  retention_delete_worker_count: 150\n\nschema_config:\n  configs:\n    - from: 2023-03-01\n      store: boltdb-shipper\n      object_store: filesystem\n      schema: v12\n      index:\n        prefix: index_\n        period: 24h\n\nstorage_config:\n  boltdb_shipper:\n    active_index_directory: /data/index\n    cache_location: /data/index-cache\n    shared_store: filesystem\n  filesystem:\n    directory: /data/chunks\n  index_queries_cache_config:\n    enable_fifocache: false\n    embedded_cache:\n      max_size_mb: 4096\n      enabled: true\nquerier:\n  multi_tenant_queries_enabled: false\n  max_concurrent: 4096\n  query_store_only: false\n\nquery_scheduler:\n  max_outstanding_requests_per_tenant: 10000\n\nquery_range:\n  cache_results: true\n  results_cache:\n    cache:\n      enable_fifocache: false\n      embedded_cache:\n        enabled: true\n\nchunk_store_config:\n  chunk_cache_config:\n    enable_fifocache: false\n    embedded_cache:\n      max_size_mb: 4096\n      enabled: true\n  write_dedupe_cache_config:\n    enable_fifocache: false\n    embedded_cache:\n      max_size_mb: 4096\n      enabled: true\n\ndistributor:\n  ring:\n    kvstore:\n      store: memberlist\n\ntable_manager:\n  retention_deletes_enabled: true\n  retention_period: 24h\n\nlimits_config:\n  ingestion_rate_mb: 64\n  ingestion_burst_size_mb: 8\n  max_label_name_length: 4096\n  max_label_value_length: 8092\n  enforce_metric_name: false\n  # Loki will reject any log lines that have already been processed and will not index them again\n  reject_old_samples: false\n  # 5y\n  reject_old_samples_max_age: 43800h\n  # The limit to length of chunk store queries. 0 to disable.\n  # 5y\n  max_query_length: 43800h\n  # Maximum number of log entries that will be returned for a query.\n  max_entries_limit_per_query: 20000\n  # Limit the maximum of unique series that is returned by a metric query.\n  max_query_series: 100000\n  # Maximum number of queries that will be scheduled in parallel by the frontend.\n  max_query_parallelism: 64\n  split_queries_by_interval: 24h\n  # Alter the log line timestamp during ingestion when the timestamp is the same as the\n  # previous entry for the same stream. When enabled, if a log line in a push request has\n  # the same timestamp as the previous line for the same stream, one nanosecond is added\n  # to the log line. This will preserve the received order of log lines with the exact\n  # same timestamp when they are queried, by slightly altering their stored timestamp.\n  # NOTE: This is imperfect, because Loki accepts out of order writes, and another push\n  # request for the same stream could contain duplicate timestamps to existing\n  # entries and they will not be incremented.\n  # CLI flag: -validation.increment-duplicate-timestamps\n  increment_duplicate_timestamp: true\n  #Log data retention for all\n  retention_period: 24h\n  # Comment this out for fine grained retention\n#  retention_stream:\n#  - selector: '{namespace=\"dev\"}'\n#    priority: 1\n#    period: 24h\n  # Comment this out for having overrides\n#  per_tenant_override_config: /etc/overrides.yaml",
                                "Envvars": false,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "++",
                                "Perms": "0644",
                                "RightDelim": "++",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/data",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_loki_volume"
                            }
                        ]
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "stack_observability_loki_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_loki_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_loki_volume",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "tempo",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": null,
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "jaeger",
                                "To": 14268,
                                "Value": 14268
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "tempo",
                                "To": 3200,
                                "Value": 3200
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "otlp_grpc",
                                "To": 4317,
                                "Value": 4317
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "otlp_http",
                                "To": 4318,
                                "Value": 4318
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "zipkin",
                                "To": 9411,
                                "Value": 9411
                            }
                        ]
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "health",
                                "OnUpdate": "require_healthy",
                                "Path": "/ready",
                                "PortLabel": "tempo",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "tempo",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "tempo",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": null,
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "tempo_zipkin_check",
                                "OnUpdate": "require_healthy",
                                "Path": "",
                                "PortLabel": "",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 1000000000,
                                "Type": "tcp"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "tempo-zipkin",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "zipkin",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": null,
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "tempo_jaeger_check",
                                "OnUpdate": "require_healthy",
                                "Path": "",
                                "PortLabel": "",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 1000000000,
                                "Type": "tcp"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "tempo-jaeger",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "jaeger",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": null,
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "tempo_otlp_grpc_check",
                                "OnUpdate": "require_healthy",
                                "Path": "",
                                "PortLabel": "",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 1000000000,
                                "Type": "tcp"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "tempo-otlp-grpc",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "otlp_grpc",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": null,
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "tempo_otlp_http_check",
                                "OnUpdate": "require_healthy",
                                "Path": "",
                                "PortLabel": "",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 1000000000,
                                "Type": "tcp"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "tempo-otlp-http",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "otlp_http",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-config.file",
                                "/config/tempo.yaml",
                                "-config.expand-env",
                                "true"
                            ],
                            "image": "registry.cloud.private/grafana/tempo:2.1.1",
                            "labels": [
                                {
                                    "com.github.logunifier.application.version": "2.1.1",
                                    "com.github.logunifier.application.name": "tempo",
                                    "com.github.logunifier.application.pattern.key": "logfmt"
                                }
                            ],
                            "ports": [
                                "jaeger",
                                "tempo",
                                "otlp_grpc",
                                "otlp_http",
                                "zipkin"
                            ],
                            "volumes": [
                                "local/tempo.yaml:/config/tempo.yaml"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "JAEGER_SAMPLER_TYPE": "const",
                            "JAEGER_TRACEID_128BIT": "true",
                            "JAEGER_REPORTER_LOG_SPANS": "true",
                            "JAEGER_SAMPLER_PARAM": "1"
                        },
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "tempo",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "MemoryMaxMB": 32768,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "local/tempo.yaml",
                                "EmbeddedTmpl": "multitenancy_enabled: false\n\nserver:\n  http_listen_port: 3200\n\ndistributor:\n  receivers:                           # this configuration will listen on all ports and protocols that tempo is capable of.\n    jaeger:                            # the receives all come from the OpenTelemetry collector.  more configuration information can\n      protocols:                       # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver\n        thrift_http:                   #\n        grpc:                          # for a production deployment you should only enable the receivers you need!\n        thrift_binary:\n        thrift_compact:\n    zipkin:\n    otlp:\n      protocols:\n        http:\n        grpc:\n    opencensus:\n\ningester:\n  trace_idle_period: 10s               # the length of time after a trace has not received spans to consider it complete and flush it\n  max_block_bytes: 1_000_000           # cut the head block when it hits this size or ...\n  max_block_duration: 5m               #   this much time passes\n\ncompactor:\n  compaction:\n    compaction_window: 1h              # blocks in this time window will be compacted together\n    max_block_bytes: 100_000_000       # maximum size of compacted blocks\n    block_retention: 24h               # Duration to keep blocks 1d\n\nmetrics_generator:\n  registry:\n    external_labels:\n      source: tempo\n      cluster: nomadder1\n  storage:\n    path: /data/generator/wal\n    remote_write:\n++- range service \"mimir\" ++\n      - url: http://++.Name++.service.consul:++.Port++/api/v1/push\n        send_exemplars: true\n        headers:\n          x-scope-orgid: 1\n++- end ++\n\nstorage:\n  trace:\n    backend: local                     # backend configuration to use\n    block:\n      bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives\n    wal:\n      path: /data/wal             # where to store the the wal locally\n    local:\n      path: /data/blocks\n    pool:\n      max_workers: 100                 # worker pool determines the number of parallel requests to the object store backend\n      queue_depth: 10000\n\nquery_frontend:\n  search:\n    # how to define year here ? define 5 years\n    max_duration: 43800h\n\noverrides:\n  metrics_generator_processors: [service-graphs, span-metrics]",
                                "Envvars": false,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "++",
                                "Perms": "0644",
                                "RightDelim": "++",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/data",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_tempo_volume"
                            }
                        ]
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "stack_observability_tempo_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_tempo_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_tempo_volume",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "nats",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "http",
                                "To": 8222,
                                "Value": 0
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "cluster",
                                "To": 6222,
                                "Value": 0
                            },
                            {
                                "HostNetwork": "default",
                                "Label": "prometheus-exporter",
                                "To": 7777,
                                "Value": 0
                            }
                        ],
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "client",
                                "To": 4222,
                                "Value": 4222
                            }
                        ]
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "service: \"nats\" check",
                                "OnUpdate": "require_healthy",
                                "Path": "/healthz",
                                "PortLabel": "http",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "nats",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "client",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 5000000000,
                                "Method": "",
                                "Name": "service: \"nats-prometheus-exporter\" check",
                                "OnUpdate": "require_healthy",
                                "Path": "/metrics",
                                "PortLabel": "prometheus-exporter",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "nats-prometheus-exporter",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "prometheus-exporter",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": [
                            "prometheus",
                            "prometheus:server_id=${NOMAD_ALLOC_NAME}",
                            "prometheus:version=2.9.16"
                        ],
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-varz",
                                "-channelz",
                                "-connz",
                                "-gatewayz",
                                "-leafz",
                                "-serverz",
                                "-subz",
                                "-jsz=all",
                                "-use_internal_server_id",
                                "http://localhost:${NOMAD_PORT_http}"
                            ],
                            "image": "registry.cloud.private/natsio/prometheus-nats-exporter:0.11.0",
                            "labels": [
                                {
                                    "com.github.logunifier.application.name": "prometheus-nats-exporter",
                                    "com.github.logunifier.application.pattern.key": "tslevelmsg",
                                    "com.github.logunifier.application.version": "0.11.0.0"
                                }
                            ],
                            "ports": [
                                "prometheus_exporter"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": {
                            "Hook": "poststart",
                            "Sidecar": true
                        },
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "nats-prometheus-exporter",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 300,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "ports": [
                                "client",
                                "http",
                                "cluster"
                            ],
                            "volumes": [
                                "local/nats.conf:/config/nats.conf"
                            ],
                            "args": [
                                "-c",
                                "/config/nats.conf",
                                "-js"
                            ],
                            "image": "registry.cloud.private/nats:2.9.16-alpine",
                            "labels": [
                                {
                                    "com.github.logunifier.application.pattern.key": "tslevelmsg",
                                    "com.github.logunifier.application.version": "2.9.16",
                                    "com.github.logunifier.application.name": "nats"
                                }
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "nats",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "MemoryMaxMB": 32768,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "local/nats.conf",
                                "EmbeddedTmpl": "# Client port of ++ env \"NOMAD_PORT_client\" ++ on all interfaces\nport: ++ env \"NOMAD_PORT_client\" ++\n\n# HTTP monitoring port\nmonitor_port: ++ env \"NOMAD_PORT_http\" ++\nserver_name: \"++ env \"NOMAD_ALLOC_NAME\" ++\"\n#If true enable protocol trace log messages. Excludes the system account.\ntrace: false\n#If true enable protocol trace log messages. Includes the system account.\ntrace_verbose: false\n#if true enable debug log messages\ndebug: false\nhttp_port: ++ env \"NOMAD_PORT_http\" ++\n#http: nats.service.consul:++ env \"NOMAD_PORT_http\" ++\n\njetstream {\n  store_dir: /data/jetstream\n\n  # 1GB\n  max_memory_store: 2G\n\n  # 10GB\n  max_file_store: 10G\n}\n",
                                "Envvars": false,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "++",
                                "Perms": "0644",
                                "RightDelim": "++",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/data/jetstream",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_nats_volume"
                            }
                        ]
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "stack_observability_nats_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_nats_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_nats_volume",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "grafana-agent",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "server",
                                "To": 0,
                                "Value": 0
                            }
                        ],
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": null
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "service: \"grafana-agent-health\" check",
                                "OnUpdate": "require_healthy",
                                "Path": "/-/healthy",
                                "PortLabel": "server",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "grafana-agent-health",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "server",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    },
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 5
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "service: \"grafana-agent-ready\" check",
                                "OnUpdate": "require_healthy",
                                "Path": "/-/ready",
                                "PortLabel": "server",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "grafana-agent-ready",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "server",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "labels": [
                                {
                                    "com.github.logunifier.application.version": "0.33.1",
                                    "com.github.logunifier.application.name": "grafana_agent",
                                    "com.github.logunifier.application.pattern.key": "logfmt"
                                }
                            ],
                            "volumes": [
                                "local/agent.yaml:/config/agent.yaml"
                            ],
                            "args": [
                                "-config.file",
                                "/config/agent.yaml",
                                "-server.http.address",
                                ":${NOMAD_HOST_PORT_server}"
                            ],
                            "image": "registry.cloud.private/grafana/agent:v0.33.1"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "grafana-agent",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 64,
                            "MemoryMaxMB": 2048,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 1,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeScript": null,
                                "ChangeSignal": "",
                                "DestPath": "local/agent.yaml",
                                "EmbeddedTmpl": "server:\n  log_level: info\n\nmetrics:\n  wal_directory: \"/data/wal\"\n  global:\n    scrape_interval: 5s\n    remote_write:\n++- range service \"mimir\" ++\n      - url: http://++.Name++.service.consul:++.Port++/api/v1/push\n++- end ++\n  configs:\n    - name: integrations\n      scrape_configs:\n        - job_name: integrations/traefik\n          scheme: http\n          metrics_path: '/metrics'\n          static_configs:\n          - targets:\n
  - ingress.cloud.private:8081\n        # grab all metric endpoints with stadanrd /metrics endpoint\n        - job_name: \"integrations/consul_sd\"\n          consul_sd_configs:\n            - server: \"consul.service.consul:8501\"\n              tags: [\"prometheus\"]\n              tls_config:\n                insecure_skip_verify: true\n                ca_file: \"/certs/ca/ca.crt\"\n                cert_file: \"/certs/consul/consul.pem\"\n                key_file: \"/certs/consul/consul-key.pem\"\n              datacenter: \"nomadder1\"\n              scheme: https\n          relabel_configs:\n            - source_labels: [__meta_consul_node]\n              target_label: instance\n            - source_labels: [__meta_consul_service]\n              target_label: service\n#            - source_labels: [__meta_consul_tags]\n#              separator:     ','\n#              regex:         'prometheus:([^=]+)=([^,]+)'\n#              target_label:  '$${1}'\n#              replacement:   '$${2}'\n
 - source_labels: [__meta_consul_tags]\n              separator:     ','\n              regex:         '.*,prometheus:server_id=([^,]+),.*'\n              target_label:  'server_id'\n              replacement:   '$${1}'\n            - source_labels: [__meta_consul_tags]\n              separator:     ','\n              regex:         '.*,prometheus:version=([^,]+),.*'\n              target_label:  'version'\n              replacement:   '$${1}'\n            - source_labels: ['__meta_consul_tags']\n              target_label: 'labels'\n              regex: '(.+)'\n              replacement: '$${1}'\n              action: 'keep'\n #           - action: replace\n #             replacement: '1'\n #
 target_label: 'test'\n          metric_relabel_configs:\n            - action: labeldrop\n
   regex: 'exported_.*'\n\n\n        - job_name: \"integrations/consul_sd_minio\"\n          metrics_path: \"/minio/v2/metrics/cluster\"\n          consul_sd_configs:\n            - server: \"consul.service.consul:8501\"\n              tags: [\"prometheus_minio\"]\n              tls_config:\n
   insecure_skip_verify: true\n                ca_file: \"/certs/ca/ca.crt\"\n                cert_file: \"/certs/consul/consul.pem\"\n                key_file: \"/certs/consul/consul-key.pem\"\n
     datacenter: \"nomadder1\"\n              scheme: https\n          relabel_configs:\n            - source_labels: [__meta_consul_node]\n              target_label: instance\n            - source_labels: [__meta_consul_service]\n              target_label: service\n#            - source_labels: [__meta_consul_tags]\n#              separator:     ','\n#              regex:         'prometheus:([^=]+)=([^,]+)'\n#              target_label:  '$${1}'\n#              replacement:   '$${2}'\n            - source_labels: [__meta_consul_tags]\n              separator:     ','\n              regex:         '.*,prometheus:server=([^,]+),.*'\n              target_label:  'server'\n              replacement:   '$${1}'\n            - source_labels: [__meta_consul_tags]\n              separator:     ','\n
   regex:         '.*,prometheus:version=([^,]+),.*'\n              target_label:  'version'\n              replacement:   '$${1}'\n            - source_labels: ['__meta_consul_tags']\n              target_label: 'labels'\n              regex: '(.+)'\n              replacement: '$${1}'\n              action: 'keep'\n#            - action: replace\n#              replacement: '38'\n#              target_label: 'test'\n          metric_relabel_configs:\n            - action: labeldrop\n              regex: 'exported_.*'",
                                "Envvars": false,
                                "ErrMissingKey": false,
                                "Gid": null,
                                "LeftDelim": "++",
                                "Perms": "0644",
                                "RightDelim": "++",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "Uid": null,
                                "VaultGrace": 0,
                                "Wait": null
                            }
                        ],
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": [
                            {
                                "Destination": "/data/wal",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "stack_observability_grafana_agent_volume"
                            },
                            {
                                "Destination": "/certs/ca",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "ca_certs"
                            },
                            {
                                "Destination": "/certs/consul",
                                "PropagationMode": "private",
                                "ReadOnly": false,
                                "Volume": "cert_consul"
                            }
                        ]
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": {
                    "cert_consul": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "cert_consul",
                        "PerAlloc": false,
                        "ReadOnly": true,
                        "Source": "cert_consul",
                        "Type": "host"
                    },
                    "stack_observability_grafana_agent_volume": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "stack_observability_grafana_agent_volume",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "stack_observability_grafana_agent_volume",
                        "Type": "host"
                    },
                    "ca_certs": {
                        "AccessMode": "",
                        "AttachmentMode": "",
                        "MountOptions": null,
                        "Name": "ca_certs",
                        "PerAlloc": false,
                        "ReadOnly": true,
                        "Source": "ca_cert",
                        "Type": "host"
                    }
                }
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.consul.version}",
                        "Operand": "semver",
                        "RTarget": ">= 1.7.0"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "MaxClientDisconnect": null,
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "logunifier",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "health",
                                "To": 3000,
                                "Value": 0
                            }
                        ],
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "bridge",
                        "ReservedPorts": null
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 10000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 5000000000,
                    "Interval": 3600000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": [
                    {
                        "Address": "",
                        "AddressMode": "auto",
                        "CanaryMeta": null,
                        "CanaryTags": null,
                        "CheckRestart": null,
                        "Checks": [
                            {
                                "AddressMode": "",
                                "Advertise": "",
                                "Args": null,
                                "Body": "",
                                "CheckRestart": {
                                    "Grace": 60000000000,
                                    "IgnoreWarnings": false,
                                    "Limit": 3
                                },
                                "Command": "",
                                "Expose": false,
                                "FailuresBeforeCritical": 0,
                                "GRPCService": "",
                                "GRPCUseTLS": false,
                                "Header": null,
                                "InitialStatus": "",
                                "Interval": 10000000000,
                                "Method": "",
                                "Name": "service: \"logunifier-health\" check",
                                "OnUpdate": "require_healthy",
                                "Path": "/health",
                                "PortLabel": "health",
                                "Protocol": "",
                                "SuccessBeforePassing": 0,
                                "TLSSkipVerify": false,
                                "TaskName": "",
                                "Timeout": 2000000000,
                                "Type": "http"
                            }
                        ],
                        "Connect": null,
                        "EnableTagOverride": false,
                        "Meta": null,
                        "Name": "logunifier-health",
                        "OnUpdate": "require_healthy",
                        "PortLabel": "health",
                        "Provider": "consul",
                        "TaggedAddresses": null,
                        "Tags": null,
                        "TaskName": ""
                    }
                ],
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "labels": [
                                {
                                    "com.github.logunifier.application.name": "logunifier",
                                    "com.github.logunifier.application.pattern.key": "tslevelmsg",
                                    "com.github.logunifier.application.strip.ansi": "true",
                                    "com.github.logunifier.application.version": "0.1.1"
                                }
                            ],
                            "ports": [
                                "health"
                            ],
                            "args": [
                                "-loglevel",
                                "debug",
                                "-natsServers",
                                "nats.service.consul:4222",
                                "-lokiServers",
                                "loki.service.consul:9005"
                            ],
                            "image": "registry.cloud.private/suikast42/logunifier:0.1.1"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "Identity": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "Disabled": false,
                            "Enabled": null,
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "logunifier",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 64,
                            "MemoryMaxMB": 2048,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 3,
                            "Delay": 5000000000,
                            "Interval": 3600000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 3600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 30000000000
        },
        "VaultNamespace": "",
        "VaultToken": "",
        "Version": 0
    }
}

@suikast42
Copy link
Contributor Author

And here is the log

Explore-logs-2023-05-08 18_36_15.txt

@gregory112
Copy link

@suikast42 After upgrading to 1.5.5 and restarting all nodes, I can see that zombie services get cleared from Consul. After running my ZooKeeper job for a while, the problem seems to come back. Like you, I also see a lot of unregistered services even after the job is stopped.

image

There are actually only three ZK instances, but there are 60++ instances there. This results in DNS service discovery giving away wrong IP addresses to services that no longer exist, causing services to fail.

@ahjohannessen
Copy link

FYI. I have the same issue. After doing nomad system gc, restarting consul client and nomad client, sometimes the problem is resolved, sometimes not, sometimes I try restart consul servers.

It seems there is some bug where nomad does not inform consul to deregister. Very frustrating.

@suikast42
Copy link
Contributor Author

FYI. I have the same issue. After doing nomad system gc, restarting consul client and nomad client, sometimes the problem is resolved, sometimes not, sometimes I try restart consul servers.

It seems there is some bug where nomad does not inform consul to deregister. Very frustrating.

My obeservation is that the start and stop order of the cluster is important.

Shutdown

  1. Stop all agent nodes
  2. Stop all master nodes ( I have only one not testest with HA master )

Boot

  1. Boot the masters ( maybe at leat one )
  2. Then boot the agent.

If I don't respect this order then the "zombies" comes back.

@shoenig
Copy link
Member

shoenig commented May 10, 2023

@suikast42 (or others), any chance you can turn on TRACE level logging on the Nomad client, and send those? @gulducat and I have each spent a few hours trying to reproduce the symptoms here but neither of us have been able to. The extra verbose logging may help us understand what we need for a reproduction.

@seanamos
Copy link

I'm also seeing this new behavior in 1.5.5 after upgrading from 1.4.5. Lots of dead zombie registrations left in Consul.

@suikast42
Copy link
Contributor Author

suikast42 commented May 11, 2023

@suikast42 (or others), any chance you can turn on TRACE level logging on the Nomad client, and send those? @gulducat and I have each spent a few hours trying to reproduce the symptoms here but neither of us have been able to. The extra verbose logging may help us understand what we need for a reproduction.

For sure.
Explore-logs-2023-05-11 11_42_02.txt

EDIT:
The allocind Ids of the zombie tasks

keycloak-postgres:
857749e0-52fe-92ee-7bef-fafbe67605ee ( zombie)
27dfb19c-1e44-2e49-a689-0a4e369f7bd2 (alive)

Grafana agent is the same group with two services 

grafana-agent-health:
a04015b3-dc90-7f18-8bfd-c1cf7bc37eff( zombie)
9a3ae9f7-2ed3-c25c-12d9-d792452841d8 (alive)

grafana-agent-ready:
a04015b3-dc90-7f18-8bfd-c1cf7bc37eff ( zombie)
9a3ae9f7-2ed3-c25c-12d9-d792452841d8 (alive)

nats:
54969951-d541-ae97-922a-7db38096bae5 ( zombie)
4d92967c-5996-752c-1cac-6f079b2c8099 (alive)

nats-prometheus-exporter: 
54969951-d541-ae97-922a-7db38096bae5 ( zombie)
4d92967c-5996-752c-1cac-6f079b2c8099 (alive)

Zombies alive with static ports:
86dcc3e0-5f12-7d37-85c4-1d9b6c82c075  ( zombie)
b9bd1537-0bae-8c11-41b3-437a4c21df29  (alive)

@seanamos
Copy link

To fill in some of the missing details from my earlier post.

Currently deployed into our testing environment:
Nomad 1.5.5
Consul 1.14.6 (plan to upgrade to 1.15.2)

Monitoring our environment, trying to catch it in the act to get relevant logs, but very much the same problem:
2 zombie instances
Screenshot 2023-05-11 at 14 08 55

Not visible from Nomad.
Screenshot 2023-05-11 at 14 10 57

Screenshot 2023-05-11 at 14 14 48

These are running in an ASG as EC2 spot instances. Terminating the EC2 instance(s) with the zombie instances gets rid of the zombie registrations, however, given time they will return, not necessarily to the same service.

@suikast42
Copy link
Contributor Author

Can confirm the same

@seanamos
Copy link

seanamos commented May 11, 2023

For me at least, stopping the affected jobs in Nomad does not deregister the zombie instances from Consul, only the instances Nomad lists within the job's allocations are deregistered.

@suikast42
Copy link
Contributor Author

For me at least, stopping the affected jobs in Nomad does not deregister the zombie instances from Consul, only the instances Nomad lists within the job's allocations are deregistered.

What about drain node restart nomad and elig the node again ?

@seanamos
Copy link

seanamos commented May 12, 2023

The pattern I'm seeing for zombie instances in Consul is a task failing it's health check, going into "fail" mode and nomad reallocating the task group. The failed tasks are never culled from Consul. Important to note, this doesn't happen 100% of the time.

Some additional info:
We are using Connect, the impacted services I've seen are using a Connect sidecar, this may just be coincidence.
TLS + ACLs are enabled.

After upgrading to Consul 1.14.x, I had to add this config to Nomad clients to get Connect sidecars working again, otherwise the Connect sidecars would fail to communicate with Consul over gRPC (probably unrelated, but thought I would mention it anyway):

consul {
  grpc_address = "127.0.0.1:8503"
  grpc_ca_file = "/opt/consul/tls/ca-cert.pem"
  # ...
}

@suikast42
Copy link
Contributor Author

The pattern I'm seeing for zombie instances in Consul is a task failing it's health check, going into "fail" mode and nomad reallocating the task group. The failed tasks are never culled from Consul. Important to note, this doesn't happen 100% of the time.

Some additional info: We are using Connect, the impacted services I've seen are using a Connect sidecar, this may just be coincidence. TLS + ACLs are enabled.

After upgrading to Consul 1.14.x, I had to add this config to Nomad clients to get Connect sidecars working again, otherwise the Connect sidecars would fail to communicate with Consul over gRPC (probably unrelated, but thought I would mention it anyway):

consul {
  grpc_address = "127.0.0.1:8503"
  grpc_ca_file = "/opt/consul/tls/ca-cert.pem"
  # ...
}

There is some chanages in grpc communication. I am at consul 1.15.2.

This issues does not relate to consul. I think the issue is on nomad side. You can delete all pssing services fron consul but nomad registers the zombies again.

shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
@shoenig
Copy link
Member

shoenig commented May 15, 2023

@Artanicus at the moment the most helpful thing would be for folks to build and run this branch of Nomad as their Client on an affected node, to see if it helps. No need to update servers.

This is the 1.5.5 release but with one extra commit (279061f)
https://github.com/hashicorp/nomad/tree/alloc-restart-zombie-1.5.5

The full contributing guide, but basically building Nomad is

make bootstrap
make pkg/linux_amd64/nomad

@vincenthuynh
Copy link

Hello! 👋

We've been seeing this issue as well. It is consistently reproducible by doing the following:

  • Hit the "Restart Allocation" button in the Nomad UI for an allocation (/v1/client/allocation/<alloc_id>/restart)
  • Let allocation restart and become healthy again
  • Hit the "Stop Allocation" button in the Nomad UI for the same allocation (/v1/allocation/<alloc_id>/stop)
  • Observe the rogue service in Consul

A reliable way to clean up Consul is by restarting the Nomad agent with the rogue service.

nomad v1.4.7
consul v1.14.5

@jinnatar
Copy link

ACK, doing some ambient testing now with a couple of nodes. Doing a systemctl restart nomad into the new binary did at least clear the previous zombie services which wasn't always guaranteed previously.

Also did some active tests with a slightly modified job that doesn't require raw-exec, I unfortunately am not seeing it having fixed the issue, with the caveat that it repeats for me just from the job's failure / it's own retry scheduling logic without any manual restarts of failed allocs needed.

repro job:

http.nomad
job "http" {
  type = "service"

  group "group" {
    network {
      mode = "host"
      port "http" {}
    }
    count = 5
    spread {
      attribute = "${node.unique.name}"
    }
    volume "httproot" {
      type = "host"
      source = "repro"
      read_only = "true"
    }

    service {
      name     = "myhttp"
      port     = "http"

      check {
        name     = "c1"
        type     = "http"
        port     = "http"
        path     = "/hi.txt"
        interval = "5s"
        timeout  = "1s"
        check_restart {
          limit = 3
          grace = "30s"
        }
      }
    }

    restart {
      attempts = 1
      delay = "10s"
      mode = "fail"
    }

    task "python" {
      driver = "exec"

      volume_mount {
        volume = "httproot"
        destination = "/srv"
        read_only = true
      }

      config {
        command = "python3"
        args    = ["-m", "http.server", "${NOMAD_PORT_http}", "--directory", "/srv"]
      }

      resources {
        cpu    = 10
        memory = 64
      }
    }
  }
}

Once I got the allocs running smoothly (more on that further down), here's some tests:

case1: Making one node with 1.5.5 (8c9fb78a) unhealthy by removing hi.txt

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
02260a00  b6d2eb73  group       3        run      running  32m ago  3m45s ago
0b298313  b156e382  group       3        run      running  32m ago  3m45s ago
1e8e3d2e  b156e382  group       3        run      running  32m ago  3m45s ago
c3ad9828  8c9fb78a  group       3        run      failed   32m ago  4s ago
e457f03c  ddcb75dc  group       3        run      running  32m ago  3m45s ago

After retries fail a replacement is kicked up on a healthy node:

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
eeec4a75  b6d2eb73  group       3        run      running  2m44s ago  2m15s ago
02260a00  b6d2eb73  group       3        run      running  35m9s ago  6m54s ago
0b298313  b156e382  group       3        run      running  35m9s ago  6m54s ago
1e8e3d2e  b156e382  group       3        run      running  35m9s ago  6m54s ago
c3ad9828  8c9fb78a  group       3        stop     failed   35m9s ago  2m44s ago
e457f03c  ddcb75dc  group       3        run      running  35m9s ago  6m54s ago

The failed alloc still holds a service, even though no restart has been attempted by myself on it:

http    10.0.10.2:27137  []    b6d2eb73  eeec4a75
http    10.0.10.2:28915  []    b6d2eb73  02260a00
http    10.0.10.3:27782  []    b156e382  1e8e3d2e
http    10.0.10.3:28067  []    b156e382  0b298313
http    10.0.10.4:20341  []    ddcb75dc  e457f03c
http    10.0.10.5:24669  []    8c9fb78a  c3ad9828

Performing an alloc restart for the failed alloc is accepted, no output produced. Does not affect the state of allocs or services.

case2: Making one node with 1.5.6-dev (b6d2eb73) unhealthy by removing hi.txt

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
eeec4a75  b6d2eb73  group       3        run      failed   10m56s ago  1s ago
02260a00  b6d2eb73  group       3        run      failed   43m21s ago  5s ago
0b298313  b156e382  group       3        run      running  43m21s ago  15m6s ago
1e8e3d2e  b156e382  group       3        run      running  43m21s ago  15m6s ago
c3ad9828  8c9fb78a  group       3        stop     failed   43m21s ago  10m56s ago
e457f03c  ddcb75dc  group       3        run      running  43m21s ago  15m6s ago

It's getting harder now to reschedule on hi.txt containing nodes:

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
7eb51aef  b6d2eb73  group       3        run      running  57s ago     6s ago
c8bca832  b6d2eb73  group       3        run      failed   2m45s ago   1m24s ago
8430239d  8c9fb78a  group       3        stop     failed   3m18s ago   57s ago
eeec4a75  b6d2eb73  group       3        stop     failed   14m39s ago  2m45s ago
02260a00  b6d2eb73  group       3        stop     failed   47m4s ago   3m18s ago
0b298313  b156e382  group       3        run      running  47m4s ago   18m49s ago
1e8e3d2e  b156e382  group       3        run      running  47m4s ago   18m49s ago
c3ad9828  8c9fb78a  group       3        stop     failed   47m4s ago   14m39s ago
e457f03c  ddcb75dc  group       3        run      running  47m4s ago   18m49s ago

This gives us the following service state:

http    10.0.10.2:25084  []    b6d2eb73  c8bca832 # new failure from case2 on 1.5.6-dev node
http    10.0.10.2:26777  []    b6d2eb73  7eb51aef # new failure from case2 on 1.5.6-dev node
http    10.0.10.2:27137  []    b6d2eb73  eeec4a75 # new failure from case2 on 1.5.6-dev node
http    10.0.10.2:28915  []    b6d2eb73  02260a00 # new failure from case2 on 1.5.6-dev node
http    10.0.10.3:27782  []    b156e382  1e8e3d2e # healthy service
http    10.0.10.3:28067  []    b156e382  0b298313 # healthy service
http    10.0.10.4:20341  []    ddcb75dc  e457f03c # healthy service
http    10.0.10.5:23538  []    8c9fb78a  8430239d # new failure from case2 on 1.5.5 node
http    10.0.10.5:24669  []    8c9fb78a  c3ad9828 # failure from case1

Other issues

While modifying the job for the exec driver the initial grace period was a bit too short for the services to become healthy fast enough. This caused a whole bunch of zombie services to get left behind even after a stop -purge and system gc. I've omitted those services from the outputs above for brevity. I also managed to probably trigger some other bug that I'll leave some data below for posterity in case it affects my repros above!

After fixing the grace to a higher value, 2 of the allocs were in state failed but somehow were still considered to be active for purposes of scheduling.

Increasing count from 4 to 5:

%> nomad plan -verbose oneoffs/zombie-repro.hcl
+/- Job: "http"                                                                                                                                 
+/- Task Group: "group" (1 create, 2 ignore, 2 in-place update)
  +/- Count: "4" => "5" (forces create)                                 
      Task: "python" 
                                                                        
Scheduler dry-run:     
- All tasks successfully allocated.                                                                                                             
                                                                                                                                                
Job Modify Index: 2072279                                               
To submit the job with version verification run:
                                                                                                                                                
nomad job run -check-index 2072279 oneoffs/zombie-repro.hcl
[...]

After a nomad stop -purge http:

%> nomad plan oneoffs/zombie-repro.hcl         
+ Job: "http"
+ Task Group: "group" (3 create, 2 ignore)
  + Task: "python" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 oneoffs/zombie-repro.hcl

[...]

Only after also running nomad system gc the ignored allocs are gone and scheduling works as I expect:

%> nomad plan oneoffs/zombie-repro.hcl
+ Job: "http"
+ Task Group: "group" (5 create)
  + Task: "python" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 oneoffs/zombie-repro.hcl
[...]

After running them again, now with the longer grace all jobs manage to get healthy, but the two failed limbo allocs are still present as registered services. One of them is on a node running 1.5.5, the other on the 1.5.6-dev.

I did these tests first with the nomad service provider, mid way through all the failing grace periods switched to the consul provider. The final remaining zombies stay behind in consul but on the nomad side I got way more service zombies from the initial failure loops, evenly distributed on all nodes regardless of running client version. The allocs that are healthy and happy do move nicely between the two providers with in-place updates but the zombies are stuck with the provider where they were created.

... So, the -purge'd and gc'd services still live on somewhere internally and are causing trouble.

@shoenig
Copy link
Member

shoenig commented May 15, 2023

Thank you @Artanicus that is very helpful. We might ask you to run another binary in the near future, with a lot more trace logging statements if we still can't figure out what's going on.

I've genuinely put 20 or so hours into trying to create a reproduction beyond the alloc restart case but so far I've got nothing (other than one other unrelated bug).

If you can, could you try to reproduce again the simplest job possible (count = 1, constraint on the node with a -dev client), with trace level logging on that client from the timespan while the reproduction is taking place? Also the Client config file could be helpful. Also are there any logs on the Server(s) about failed RPCs coming from the affected Client?

@jinnatar
Copy link

Here's an attempt at a more hermetic and simple repro with a standlone vagrant dev agent.

  1. Created a simple vagrant machine out of generic/debian11 with the dev binary and a private network
Vagrantfile
Vagrant.configure("2") do |config|
  config.vm.box = "generic/debian11"
  config.vm.network "private_network", ip: "192.168.56.10"
  config.vm.synced_folder ".", "/local"
  config.vm.provision "shell", inline: <<-SHELL
    install /local/nomad /usr/bin/nomad
  SHELL
end

Version details:

vagrant@debian11:~$ nomad version
Nomad v1.5.6-dev
BuildDate 2023-05-15T13:12:54Z
Revision 279061f8ee966ee290023d90cb7469b59c90a182+CHANGES
  1. Started an agent with nomad agent dev -bind 0.0.0.0
  2. Ran a simpler repro job
http.nomad
job "http" {
  type = "service"

  group "group" {
    network {
      mode = "host"
      port "http" {}
    }
    count = 1

    service {
      name     = "myhttp"
      port     = "http"
      provider = "nomad"

      check {
        name     = "c1"
        type     = "http"
        port     = "http"
        path     = "/hi.txt"
        interval = "5s"
        timeout  = "1s"
        check_restart {
          limit = 3
          grace = "30s"
        }
      }
    }

    restart {
      attempts = 1
      delay = "10s"
      mode = "fail"
    }

    task "python" {
      driver = "raw_exec"

      config {
        command = "python3"
        args    = ["-m", "http.server", "${NOMAD_PORT_http}", "--directory", "/tmp"]
      }

      resources {
        cpu    = 10
        memory = 64
      }
    }
  }
}

At first I let it run for a good while without /tmp/hi.txt in place.

This produced the following state during those failures, services not going away from failed allocs (I find it really weird nomad internal services have no healthy bit I can see):

%> nomad service info myhttp         
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:29346  []    aa21073d  6aa6966a
http    127.0.0.1:21945  []    aa21073d  7ed96d68
http    127.0.0.1:21505  []    aa21073d  cfc634e5

I then created /tmp.hi.txt which lead to a healthy allocation at about 9:09 UTC as per the logs.

State after that in a "healthy" situation:

%> nomad status http
ID            = http
Name          = http
Submit Date   = 2023-05-16T12:01:55+03:00
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
group       0       0         1        3       0         0     0

Latest Deployment
ID          = dc0bfc25
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       1        4       1        3          2023-05-16T09:19:40Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
2245de62  aa21073d  group       0        run      running  8m45s ago   8m32s ago
6aa6966a  aa21073d  group       0        stop     failed   12m5s ago   8m45s ago
7ed96d68  aa21073d  group       0        stop     failed   14m26s ago  12m5s ago
cfc634e5  aa21073d  group       0        stop     failed   16m16s ago  14m26s ago

%> nomad service info myhttp
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:27759  []    aa21073d  2245de62
http    127.0.0.1:29346  []    aa21073d  6aa6966a
http    127.0.0.1:21945  []    aa21073d  7ed96d68
http    127.0.0.1:21505  []    aa21073d  cfc634e5

I then removed /tmp/hi.txt.

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created     Modified
ebd1fc2a  aa21073d  group       0        run      failed  3m41s ago   2m20s ago
2245de62  aa21073d  group       0        stop     failed  20m5s ago   3m41s ago
6aa6966a  aa21073d  group       0        stop     failed  23m25s ago  20m5s ago
7ed96d68  aa21073d  group       0        stop     failed  25m46s ago  23m25s ago
cfc634e5  aa21073d  group       0        stop     failed  27m36s ago  25m46s ago


%> nomad service info myhttp
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:27759  []    aa21073d  2245de62
http    127.0.0.1:29346  []    aa21073d  6aa6966a
http    127.0.0.1:21945  []    aa21073d  7ed96d68
http    127.0.0.1:21505  []    aa21073d  cfc634e5
http    127.0.0.1:25815  []    aa21073d  ebd1fc2a

Logs have much spam of watched checks not being found. I've included the dev agent log in full.
agent.log

Is my assumption even correct that the service entries should go away with failed allocs? Not used to working with nomad services and a bit weirded out there's no visible healthy bit. But if I was loadbalancing something onto that list of services with no extra health checks then yeah, they'd be hitting dead ones constantly.

@suikast42
Copy link
Contributor Author

suikast42 commented May 16, 2023

But if I was loadbalancing something onto that list of services with no extra health checks then yeah, they'd be hitting dead ones constantly.

That true if you use dynamic port binding. If you do healthcheck on static port than the situation is the worst secenario. the helathcheck is ok but the lb delegates to a dead alloc

@jinnatar
Copy link

jinnatar commented May 16, 2023

Here's an even simpler repro, this time starting from a healthy state. Same Vagrantfile, same agent setup.

After removing hi.txt new allocs are attempted after retries and each creates a new service entry:

ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6d007bc7  9c8d38ff  group       0        run      running  15s ago    14s ago
6200cf76  9c8d38ff  group       0        stop     failed   2m35s ago  15s ago
843ca52a  9c8d38ff  group       0        stop     failed   5m29s ago  2m35s ago

%> nomad service info myhttp
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:30086  []    9c8d38ff  6200cf76
http    127.0.0.1:22833  []    9c8d38ff  6d007bc7
http    127.0.0.1:23706  []    9c8d38ff  843ca52a

After nomad stop http one service is purged:

%> nomad service info myhttp
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:30086  []    9c8d38ff  6200cf76
http    127.0.0.1:23706  []    9c8d38ff  843ca52a

Even after a nomad system gc the two zombies remain.

If I perform manual service deletes:

%> nomad service info -verbose myhttp 
ID           = _nomad-task-6200cf76-f53e-ed05-511d-e99f4b805c67-group-group-myhttp-http
Service Name = myhttp
Namespace    = default
Job ID       = http
Alloc ID     = 6200cf76-f53e-ed05-511d-e99f4b805c67
Node ID      = 9c8d38ff-89c2-e00e-9817-e3e632e637f1
Datacenter   = dc1
Address      = 127.0.0.1:30086
Tags         = []

ID           = _nomad-task-843ca52a-1445-c0ca-50e8-e2ffbc96d522-group-group-myhttp-http
Service Name = myhttp
Namespace    = default
Job ID       = http
Alloc ID     = 843ca52a-1445-c0ca-50e8-e2ffbc96d522
Node ID      = 9c8d38ff-89c2-e00e-9817-e3e632e637f1
Datacenter   = dc1
Address      = 127.0.0.1:23706
Tags         = []

%> nomad service delete myhttp _nomad-task-843ca52a-1445-c0ca-50e8-e2ffbc96d522-group-group-myhttp-http
Successfully deleted service registration

The delete does seem to stick, unlike what we were seeing on the Consul side where they got re-created. Even after deleting both service entries, there's still constant log spam though of watch.checks not finding the checks it's looking for:

2023-05-16T10:15:13.104Z [WARN]  watch.checks: watched check not found: check_id=02e2d1135ea5988f3ac6e787c91ff1cf
    2023-05-16T10:15:13.104Z [WARN]  watch.checks: watched check not found: check_id=857b763c84e4c492be404624a0af7f13
    2023-05-16T10:15:14.105Z [WARN]  watch.checks: watched check not found: check_id=02e2d1135ea5988f3ac6e787c91ff1cf
    2023-05-16T10:15:14.105Z [WARN]  watch.checks: watched check not found: check_id=857b763c84e4c492be404624a0af7f13
[...]

Full log:
agent.log

@shoenig
Copy link
Member

shoenig commented May 16, 2023

Thanks again @Artanicus, I think we understand what happened now - this bug was originally fixed in 889c5aa0f7 and intended to be backported to 1.3.x, 1.4.x, and 1.5.x (that's what those backport/ labels do - they trigger automation).

Unfortunately something in our pipeline lost/forgot/ignored the 1.5.x pipeline, and so this whole time I've been working off of main (which contains the fix), assuming it was pretty close to 1.5.5. I've started a manual backport to 1.5.x in #17212.

If you could try to reproduce again using branch manual-backport-889c5aa0f7, it should hopefully just be fixed.

@jinnatar
Copy link

The branch manual-backport-889c5aa0f7 doesn't seem to exist anymore but since it seems like this made it into release/1.5.x I'll try with that.

  • Same Vagrant & job setup as before, starting from a healthy state.
  • After triggering and while the health check is failing and the task is in a pending state the service remains active, but as soon as it's marked as failed, the service is correctly unregistered.
  • As new allocs are created a new service entry is registered even though it's not healthy, but they too are correctly cleared as soon as the alloc fails.
  • Stopping the job in the middle of a pending alloc correctly unregisters the service.

So it's at least a significant step in a good direction. With the consul service provider I imagine the transient failing services are registered but since they're failing health checks, won't affect downstream loadbalancing. With built-in nomad services it's a bit more iffy since there seems to be no distinction between healthy and unhealthy services, but that I believe is a wholly separate bug/feature to debate and not in scope here. :-)

State snapshot for posterity:

vagrant@debian11:~$ nomad version
Nomad v1.5.6-dev
BuildDate 2023-05-16T20:10:32Z
Revision 28fd73195ca43153cd93c2989b7d445325a8fa20+CHANGES
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
18236423  a335897f  group       0        run      running  1m13s ago  22s ago
987ee788  a335897f  group       0        stop     failed   3m33s ago  1m13s ago
b3ef521e  a335897f  group       0        stop     failed   5m37s ago  3m33s ago

%> nomad service info myhttp
Job ID  Address          Tags  Node ID   Alloc ID
http    127.0.0.1:23745  []    a335897f  18236423

Full agent log:
agent.log

@shoenig
Copy link
Member

shoenig commented May 17, 2023

Thanks again for the testing @Artanicus! We're planning to cut a bugfix release on Monday for this and a few other issues.

@shoenig shoenig added this to the 1.5.x milestone May 17, 2023
@shoenig shoenig pinned this issue May 17, 2023
@shoenig
Copy link
Member

shoenig commented May 17, 2023

With built-in nomad services it's a bit more iffy since there seems to be no distinction between healthy and unhealthy services, but that I believe is a wholly separate bug/feature to debate and not in scope here. :-)

Indeed, Nomad's built-in services actually do contain health check information, but the result of those healchecks is currently only stored on the Client that executed the check and are used for passing/failing deployments, rather than LB availability. You can view their status per-allocation with nomad alloc checks <allocID>. #15603 is tracking the desire to make them more useful like Consul checks, but we're trying to avoid having to store check state information on the Nomad servers, which is very taxing on the underlying raft persistent store.

@suikast42
Copy link
Contributor Author

It seems that v 1.5.6 solves that.

I am closing this issue.

@116davinder
Copy link

For some reason, issue still exists for me after upgrade to 1.6.1 on server and 1.5.6 on clients :( .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

10 participants