Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services registered with Nomad Service Provider stayed forever after a major server outage #16762

Closed
aofei opened this issue Apr 3, 2023 · 9 comments

Comments

@aofei
Copy link
Contributor

aofei commented Apr 3, 2023

Nomad version

Nomad v1.5.2
BuildDate 2023-03-21T22:54:38Z
Revision 9a2fdb5

Operating system and Environment details

$ hostnamectl
 Static hostname: dev-machine
       Icon name: computer-vm
         Chassis: vm
      Machine ID: ebe933b74fe25215fc0f3b0be782a241
         Boot ID: 1051fac2caf742699078f5e6c97161b9
  Virtualization: kvm
Operating System: Ubuntu 22.04.2 LTS
          Kernel: Linux 5.15.0-1030-gcp
    Architecture: x86-64
 Hardware Vendor: Google
  Hardware Model: Google Compute Engine

Issue

This bug was actually discovered together with #16760. But I believe they're not caused by the same reason, so I opened this issue separately.

So the thing is, for some reason, all three of my Nomad servers went down. And I didn't get them back online in time because I was trying to figure out what was causing them to go down. I thought my running jobs wouldn't be affected during this time. But, unexpectedly, all my jobs involving nomadVar or nomadService template functions also went down (other jobs keep running).

Anyway, after I brought the servers back online, my jobs that unexpectedly went down are running again. But another thing happened, the services registered by the jobs that down unexpectedly before are not removed.

Reproduction steps

  1. Copy and run the following shell script:
#!/bin/sh

NOMAD_ISSUE_DIR=/tmp/nomad-issue-16762
mkdir $NOMAD_ISSUE_DIR

cat << EOF > $NOMAD_ISSUE_DIR/server.hcl
name = "server"
data_dir = "$NOMAD_ISSUE_DIR"
server {
	enabled = true
	bootstrap_expect = 1
}
EOF

cat << EOF > $NOMAD_ISSUE_DIR/client.hcl
name = "client"
data_dir = "$NOMAD_ISSUE_DIR"
ports {
	http = 14646
}
client {
	enabled = true
	server_join {
		retry_join = ["127.0.0.1"]
	}
}
EOF

cat << EOF > $NOMAD_ISSUE_DIR/foobar.nomad.hcl
job "foobar" {
	group "foobar" {
		network {
			port "foobar" {}
		}
		task "foobar" {
			driver = "docker"
			config {
				image = "alpine"
				entrypoint = ["tail"]
				args = ["-f", "/dev/null"]
			}
			service {
				name = "foobar"
				port = "foobar"
				provider = "nomad"
                        }
			template {
				destination = "local/foobar"
				data = <<TEMPLATE_DATA_EOF
{{with nomadService "foobar"}}{{end}}
TEMPLATE_DATA_EOF
			}
		}
	}
}
EOF
  1. Open terminal tab 1 and execute:
$ nomad agent -config /tmp/nomad-issue-16762/server.hcl &> /tmp/nomad-issue-16762/server.log
  1. Open terminal tab 2 and execute:
$ nomad agent -config /tmp/nomad-issue-16762/client.hcl &> /tmp/nomad-issue-16762/client.log
  1. Open terminal tab 3 and execute:
$ tail -f /tmp/nomad-issue-16762/client.log | grep -e "agent: (view) nomad.var.block(" -e "agent: (view) nomad.service(" -e "(retry" -e "(exceeded"
  1. Open terminal tab 4 and execute:
$ nomad run /tmp/nomad-issue-16762/foobar.nomad.hcl
  1. If everything is fine, do docker container ls and you should get 1 running container: foobar-{allocId}. Do nomad service list and you should get a service named "foobar". Do nomad service info foobar and you should only get one entry.

  2. Now go to terminal tab 1 and close the server (Ctrl+C).

  3. Now go to terminal tab 3 and wait a few minutes until the log "exceeded maximum retries" to appear.

  4. At this point, do docker container ls and you should not get any running containers.

  5. Now go to terminal tab 1 and start the server again:

$ nomad agent -config /tmp/nomad-issue-16762/server.hcl &> /tmp/nomad-issue-16762/server.log
  1. At this point, do docker container ls and you should get 1 running container: foobar-{allocId} (if not, wait a few seconds). Do nomad service list and you should get a service named "foobar". Do nomad service info foobar and you should get two entries, the one you saw in step 6 should still there.
@plasmastorm
Copy link

I think this is the same as #15630 - was only a single node test cluster in my case but could see how all server nodes being down would lead to the same situation

@michael-strigo
Copy link

michael-strigo commented May 8, 2023

I just had a very similar scenario. I bought down two of the 3 Nomad servers in our dev env. Did manually recovery via peers.json and found my entire dev env down. I was also under impression that service outage should hurt clients but it looks like it did. Furthermore, the recovery required us to restart the jobs impacted.

We also have the issue where zombie service allocations are still registered.

This all happened on 1.5.3.

@michael-strigo
Copy link

michael-strigo commented May 8, 2023

@schmichael any chance you folks can clarify what are the guarantees around scenarios involving Nomad server downtime with regards to Nomad variables & templates?
As far as I can tell, there's no way to configure tempalte{} for stale reads from nomad.

@michael-strigo
Copy link

Made this small script to clean up. Might be useful for anyone who is stuck with bunch of zombie services:

#!/bin/sh
set -e

services=$(nomad service list -t '{{ range (index . 0).Services }}{{printf "%s\n" .ServiceName }}{{ end }}')

for svc in $services; do
	echo "checking $svc:"
	data=$(nomad service info -t '{{ range . }}{{ printf "%s" .AllocID }}%{{ printf "%s\n" .ID }}{{ end }}' "$svc" | uniq)

	for d in $data; do
		alloc=$(echo "$d" | cut -d'%' -f1)
		svc_id=$(echo "$d" | cut -d'%' -f2)
		echo "    checking $alloc ($svc_id)"
		if ! nomad alloc status "$alloc" > /dev/null 2>&1; then
			echo "    !! removing $svc_id"
			nomad service delete "$svc" "$svc_id" > /dev/null 2>&1
		fi
	done
	echo
done

@tgross
Copy link
Member

tgross commented May 15, 2023

Although this is with Nomad services and not Consul, based on internal discussions this is most likely a duplicate of #17079, which @shoenig is already working on.

@mr-karan
Copy link
Contributor

@michael-strigo Thanks a lot for this. I made slight modifications to loop over all namespaces and also ignore namespaces where no service registrations are found.

Here's the modified version:

#!/bin/bash
set -e

namespaces=$(nomad namespace list -json | jq -r '.[].Name')

for ns in $namespaces; do
    echo "checking namespace $ns:"
    services=$(nomad service list -namespace=$ns -t '{{ range (index . 0).Services }}{{printf "%s\n" .ServiceName }}{{ end }}')

  	if [ "$services" = "No service registrations found" ]; then
		echo "no services found for namespace $ns"
		continue
	fi

    for svc in $services; do
        echo "  checking $svc:"
        data=$(nomad service info -namespace=$ns -t '{{ range . }}{{ printf "%s" .AllocID }}%{{ printf "%s\n" .ID }}{{ end }}' "$svc" | uniq)

        for d in $data; do
            alloc=$(echo "$d" | cut -d'%' -f1)
            svc_id=$(echo "$d" | cut -d'%' -f2)
            # echo "      checking $alloc ($svc_id)"
            if ! nomad alloc status -namespace=$ns "$alloc" > /dev/null 2>&1; then
                echo "      !! removing $svc_id"
                nomad service delete -namespace=$ns "$svc" "$svc_id" > /dev/null 2>&1
            fi
        done
        echo
    done
done

@tgross
Copy link
Member

tgross commented May 18, 2023

@shoenig I'm going to assign this to you just so we don't lose track of it after we close #17079, or if you want we can just close it as a duplicate.

@aroundthfur
Copy link

This issue seems to still be present in 1.6.1. Is there any information we can provide to help debug and solve the issue?

@tgross
Copy link
Member

tgross commented May 14, 2024

I'm going to assign myself this issue, considering it a duplicate of #16616. Please see my comment #16616 (comment) if anyone has anything additional to share.

Closing as duplicate.

@tgross tgross assigned tgross and unassigned shoenig May 14, 2024
@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Nomad - Community Issues Triage automation moved this from In Progress to Done May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment