Services registered with Nomad Service Provider stayed forever after a major server outage #16762

aofei · 2023-04-03T09:58:07Z

Nomad version

Nomad v1.5.2
BuildDate 2023-03-21T22:54:38Z
Revision 9a2fdb5

Operating system and Environment details

$ hostnamectl
 Static hostname: dev-machine
       Icon name: computer-vm
         Chassis: vm
      Machine ID: ebe933b74fe25215fc0f3b0be782a241
         Boot ID: 1051fac2caf742699078f5e6c97161b9
  Virtualization: kvm
Operating System: Ubuntu 22.04.2 LTS
          Kernel: Linux 5.15.0-1030-gcp
    Architecture: x86-64
 Hardware Vendor: Google
  Hardware Model: Google Compute Engine

Issue

This bug was actually discovered together with #16760. But I believe they're not caused by the same reason, so I opened this issue separately.

So the thing is, for some reason, all three of my Nomad servers went down. And I didn't get them back online in time because I was trying to figure out what was causing them to go down. I thought my running jobs wouldn't be affected during this time. But, unexpectedly, all my jobs involving nomadVar or nomadService template functions also went down (other jobs keep running).

Anyway, after I brought the servers back online, my jobs that unexpectedly went down are running again. But another thing happened, the services registered by the jobs that down unexpectedly before are not removed.

Reproduction steps

Copy and run the following shell script:

#!/bin/sh

NOMAD_ISSUE_DIR=/tmp/nomad-issue-16762
mkdir $NOMAD_ISSUE_DIR

cat << EOF > $NOMAD_ISSUE_DIR/server.hcl
name = "server"
data_dir = "$NOMAD_ISSUE_DIR"
server {
	enabled = true
	bootstrap_expect = 1
}
EOF

cat << EOF > $NOMAD_ISSUE_DIR/client.hcl
name = "client"
data_dir = "$NOMAD_ISSUE_DIR"
ports {
	http = 14646
}
client {
	enabled = true
	server_join {
		retry_join = ["127.0.0.1"]
	}
}
EOF

cat << EOF > $NOMAD_ISSUE_DIR/foobar.nomad.hcl
job "foobar" {
	group "foobar" {
		network {
			port "foobar" {}
		}
		task "foobar" {
			driver = "docker"
			config {
				image = "alpine"
				entrypoint = ["tail"]
				args = ["-f", "/dev/null"]
			}
			service {
				name = "foobar"
				port = "foobar"
				provider = "nomad"
                        }
			template {
				destination = "local/foobar"
				data = <<TEMPLATE_DATA_EOF
{{with nomadService "foobar"}}{{end}}
TEMPLATE_DATA_EOF
			}
		}
	}
}
EOF

Open terminal tab 1 and execute:

$ nomad agent -config /tmp/nomad-issue-16762/server.hcl &> /tmp/nomad-issue-16762/server.log

Open terminal tab 2 and execute:

$ nomad agent -config /tmp/nomad-issue-16762/client.hcl &> /tmp/nomad-issue-16762/client.log

Open terminal tab 3 and execute:

$ tail -f /tmp/nomad-issue-16762/client.log | grep -e "agent: (view) nomad.var.block(" -e "agent: (view) nomad.service(" -e "(retry" -e "(exceeded"

Open terminal tab 4 and execute:

$ nomad run /tmp/nomad-issue-16762/foobar.nomad.hcl

If everything is fine, do docker container ls and you should get 1 running container: foobar-{allocId}. Do nomad service list and you should get a service named "foobar". Do nomad service info foobar and you should only get one entry.
Now go to terminal tab 1 and close the server (Ctrl+C).
Now go to terminal tab 3 and wait a few minutes until the log "exceeded maximum retries" to appear.
At this point, do docker container ls and you should not get any running containers.
Now go to terminal tab 1 and start the server again:

$ nomad agent -config /tmp/nomad-issue-16762/server.hcl &> /tmp/nomad-issue-16762/server.log

At this point, do docker container ls and you should get 1 running container: foobar-{allocId} (if not, wait a few seconds). Do nomad service list and you should get a service named "foobar". Do nomad service info foobar and you should get two entries, the one you saw in step 6 should still there.

The text was updated successfully, but these errors were encountered:

plasmastorm · 2023-04-21T18:42:02Z

I think this is the same as #15630 - was only a single node test cluster in my case but could see how all server nodes being down would lead to the same situation

michael-strigo · 2023-05-08T05:37:30Z

I just had a very similar scenario. I bought down two of the 3 Nomad servers in our dev env. Did manually recovery via peers.json and found my entire dev env down. I was also under impression that service outage should hurt clients but it looks like it did. Furthermore, the recovery required us to restart the jobs impacted.

We also have the issue where zombie service allocations are still registered.

This all happened on 1.5.3.

michael-strigo · 2023-05-08T05:39:32Z

@schmichael any chance you folks can clarify what are the guarantees around scenarios involving Nomad server downtime with regards to Nomad variables & templates?
As far as I can tell, there's no way to configure tempalte{} for stale reads from nomad.

michael-strigo · 2023-05-08T07:56:18Z

Made this small script to clean up. Might be useful for anyone who is stuck with bunch of zombie services:

#!/bin/sh
set -e

services=$(nomad service list -t '{{ range (index . 0).Services }}{{printf "%s\n" .ServiceName }}{{ end }}')

for svc in $services; do
	echo "checking $svc:"
	data=$(nomad service info -t '{{ range . }}{{ printf "%s" .AllocID }}%{{ printf "%s\n" .ID }}{{ end }}' "$svc" | uniq)

	for d in $data; do
		alloc=$(echo "$d" | cut -d'%' -f1)
		svc_id=$(echo "$d" | cut -d'%' -f2)
		echo "    checking $alloc ($svc_id)"
		if ! nomad alloc status "$alloc" > /dev/null 2>&1; then
			echo "    !! removing $svc_id"
			nomad service delete "$svc" "$svc_id" > /dev/null 2>&1
		fi
	done
	echo
done

tgross · 2023-05-15T15:09:45Z

Although this is with Nomad services and not Consul, based on internal discussions this is most likely a duplicate of #17079, which @shoenig is already working on.

mr-karan · 2023-05-16T13:36:38Z

@michael-strigo Thanks a lot for this. I made slight modifications to loop over all namespaces and also ignore namespaces where no service registrations are found.

Here's the modified version:

#!/bin/bash
set -e

namespaces=$(nomad namespace list -json | jq -r '.[].Name')

for ns in $namespaces; do
    echo "checking namespace $ns:"
    services=$(nomad service list -namespace=$ns -t '{{ range (index . 0).Services }}{{printf "%s\n" .ServiceName }}{{ end }}')

  	if [ "$services" = "No service registrations found" ]; then
		echo "no services found for namespace $ns"
		continue
	fi

    for svc in $services; do
        echo "  checking $svc:"
        data=$(nomad service info -namespace=$ns -t '{{ range . }}{{ printf "%s" .AllocID }}%{{ printf "%s\n" .ID }}{{ end }}' "$svc" | uniq)

        for d in $data; do
            alloc=$(echo "$d" | cut -d'%' -f1)
            svc_id=$(echo "$d" | cut -d'%' -f2)
            # echo "      checking $alloc ($svc_id)"
            if ! nomad alloc status -namespace=$ns "$alloc" > /dev/null 2>&1; then
                echo "      !! removing $svc_id"
                nomad service delete -namespace=$ns "$svc" "$svc_id" > /dev/null 2>&1
            fi
        done
        echo
    done
done

tgross · 2023-05-18T19:40:59Z

@shoenig I'm going to assign this to you just so we don't lose track of it after we close #17079, or if you want we can just close it as a duplicate.

aroundthfur · 2023-10-26T18:35:21Z

This issue seems to still be present in 1.6.1. Is there any information we can provide to help debug and solve the issue?

tgross · 2024-05-14T19:20:57Z

I'm going to assign myself this issue, considering it a duplicate of #16616. Please see my comment #16616 (comment) if anyone has anything additional to share.

Closing as duplicate.

aofei added the type/bug label Apr 3, 2023

schmichael added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 8, 2023

schmichael added theme/service-discovery theme/service-discovery/nomad stage/needs-investigation labels Apr 8, 2023

SamMousa mentioned this issue May 15, 2023

Nomad services linger with invalid allocIDS #17182

Closed

shoenig mentioned this issue May 15, 2023

client: ignore restart issued to terminal allocations #17175

Merged

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 18, 2023

tgross assigned shoenig May 18, 2023

icyleaf mentioned this issue Jul 31, 2023

Add nomad-invalid-services-cleaner jippi/awesome-nomad#49

Closed

tgross assigned tgross and unassigned shoenig May 14, 2024

tgross closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024

Nomad - Community Issues Triage automation moved this from In Progress to Done May 14, 2024

tgross added the stage/duplicate label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Services registered with Nomad Service Provider stayed forever after a major server outage #16762

Services registered with Nomad Service Provider stayed forever after a major server outage #16762

aofei commented Apr 3, 2023

plasmastorm commented Apr 21, 2023

michael-strigo commented May 8, 2023 •

edited

Loading

michael-strigo commented May 8, 2023 •

edited

Loading

michael-strigo commented May 8, 2023

tgross commented May 15, 2023

mr-karan commented May 16, 2023

tgross commented May 18, 2023

aroundthfur commented Oct 26, 2023

tgross commented May 14, 2024 •

edited

Loading

Services registered with Nomad Service Provider stayed forever after a major server outage #16762

Services registered with Nomad Service Provider stayed forever after a major server outage #16762

Comments

aofei commented Apr 3, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

plasmastorm commented Apr 21, 2023

michael-strigo commented May 8, 2023 • edited Loading

michael-strigo commented May 8, 2023 • edited Loading

michael-strigo commented May 8, 2023

tgross commented May 15, 2023

mr-karan commented May 16, 2023

tgross commented May 18, 2023

aroundthfur commented Oct 26, 2023

tgross commented May 14, 2024 • edited Loading

michael-strigo commented May 8, 2023 •

edited

Loading

michael-strigo commented May 8, 2023 •

edited

Loading

tgross commented May 14, 2024 •

edited

Loading