-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restarting server and client results in failing Consul health checks #16453
Comments
Hi @suikast42! Here's the relevant bit from your logs:
The first thing I'd look at in this case is to see if the health check should be passing. Just because the client's handle to the allocation has been restored doesn't mean that it's actually able to serve the health check. You should check:
It's also a little unclear as to whether you've restarted the client agent or the entire host the client is on. If you reboot the host, the allocation will not restore during the Nomad client start, and the client will have to get rescheduled allocation from the server. That might explain why restarting the server caused the problem.
The server will wait until the client misses a heartbeat before marking the client as |
I redefine the reschedule, restart , update configs as folowing and not have the described issue anymore. job "minio" {
datacenters = ["nomadder1"]
type = "service"
reschedule {
delay = "10s"
delay_function = "constant"
unlimited = true
}
update {
max_parallel = 1
health_check = "checks"
# Alloc is marked as unhealthy after this time
healthy_deadline = "5m"
auto_revert = true
# Mark the task as healthy after 10s positive check
min_healthy_time = "10s"
# Task is dead after failed checks in 1h
progress_deadline = "1h"
}
group "minio" {
count = 1
volume "stack_core_minio_volume" {
type = "host"
source = "core_minio_volume"
read_only = false
}
volume "ca_cert" {
type = "host"
source = "ca_cert"
read_only = true
}
restart {
attempts = 1
interval = "1h"
delay = "5s"
mode = "fail"
}
network {
mode = "bridge"
port "http" {
to = 9000
}
port "console" {
to = 9001
}
}
task "minio" {
volume_mount {
volume = "stack_core_minio_volume"
destination = "/data"
}
volume_mount {
volume = "ca_cert"
# the server searches in the /CAs path at that specified directory.
# Do not change the sub folder name CAs
destination = "/certs/CAs"
}
driver = "docker"
config {
image = "registry.cloud.private/minio/minio:RELEASE.2023-03-09T23-16-13Z"
command = "server"
args = [
"/data",
"--console-address",
":9001",
"--certs-dir",
"/certs"
]
ports = ["http","console"]
}
env {
HOSTNAME = "${NOMAD_ALLOC_NAME}"
MINIO_SERVER_URL = "https://minio.cloud.private"
#MINIO_IDENTITY_OPENID_CONFIG_URL="https://security.cloud.private/realms/nomadder/.well-known/openid-configuration"
MINIO_PROMETHEUS_AUTH_TYPE = "public"
MINIO_PROMETHEUS_URL = "http://mimir.service.consul:9009/prometheus"
MINIO_PROMETHEUS_JOB_ID = "integrations/minio"
}
template {
destination = "${NOMAD_SECRETS_DIR}/env.vars"
env = true
change_mode = "restart"
data = <<EOF
{{- with nomadVar "nomad/jobs/minio" -}}
MINIO_ROOT_USER = {{.minio_root_user}}
MINIO_ROOT_PASSWORD = {{.minio_root_password}}
{{- end -}}
EOF
}
resources {
cpu= 500
memory = 512
memory_max = 4096
}
service {
port = "http"
name = "minio"
tags = [
"prometheus_minio",
"frontend",
"minio",
"prometheus:server=${NOMAD_ALLOC_NAME}",
"prometheus:version=RELEASE.2023-03-09T23-16-13Z",
"traefik.enable=true",
"traefik.consulcatalog.connect=false",
"traefik.http.routers.minio.tls=true",
"traefik.http.routers.minio.rule=Host(`minio.cloud.private`)",
]
check {
name = "minio-live"
type = "http"
path = "/minio/health/live"
port = "http"
interval = "10s"
timeout = "2s"
}
check {
name = "minio-ready"
type = "http"
path = "/minio/health/ready"
port = "http"
interval = "10s"
timeout = "2s"
check_restart {
limit = 3
grace = "60s"
ignore_warnings = false
}
}
}
service {
port = "console"
name = "minio-console"
tags = [
"console",
"minio",
"traefik.enable=true",
"traefik.consulcatalog.connect=false",
"traefik.http.routers.minio-console.tls=true",
"traefik.http.routers.minio-console.rule=Host(`minio.console.cloud.private`)",
]
check {
type = "http"
path = "/"
port = "console"
interval = "10s"
timeout = "2s"
check_restart {
limit = 3
grace = "60s"
ignore_warnings = false
}
}
}
}
}
} |
Nomad version
Nomad v1.5.1
BuildDate 2023-03-10T22:05:57Z
Revision 6c118dd
Consul version
Consul v1.15.1
Revision 7c04b6a0
Build Date 2023-03-07T20:35:33Z
Operating system and Environment details
Ubuntu 22.04 In VMWare Worksation
Single Node Master
Single Node Worker
Issue
If I deloy the job below at first time everything works fine.
If I reboot only the worker machine this works fine as well.
But If I (re)boot the master and server then the Nomoad UI shows a helathy deloyment but consul shows failed healthcheck.
A restart of the worker deploys the minio job as expected again.
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: