Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Service Discovery - Dead/old services still listed after dev reboot #15630

Closed
plasmastorm opened this issue Jan 1, 2023 · 3 comments
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/duplicate theme/service-discovery/nomad type/bug

Comments

@plasmastorm
Copy link

Nomad version

1.4.3

Operating system and Environment details

Debian Bullseye - single dev server running both server and client modes for testing

Issue

I'm using Nomad service discovery to connect an app container to database and redis services.

There seem to be services from older allocations being presented as live after a series of server reboots - note that only one of these services has an active allocation:

image

No distinction seems to be made in the service info output so all the entries are provided to the template

$ nomad service info postgres
Job ID  Address           Tags  Node ID   Alloc ID
netbox  10.10.10.1:27536  []    0a3ab847  0902374c
netbox  10.10.10.1:25029  []    0a3ab847  5509e74e
netbox  10.10.10.1:29208  []    0a3ab847  600dc346
netbox  10.10.10.1:22442  []    0a3ab847  6bdbb8da
netbox  10.10.10.1:26311  []    0a3ab847  d388330d
netbox  10.10.10.1:27396  []    0a3ab847  f3260e33
netbox  10.10.10.1:29143  []    0a3ab847  f4b75da9

Which gives the following output when using a template like:

{{ range nomadService "postgres" -}}
DB_HOST={{ .Address }}
DB_PORT={{ .Port }}
{{ end -}}
DB_HOST=10.10.10.1
DB_PORT=27536
DB_HOST=10.10.10.1
DB_PORT=25029
DB_HOST=10.10.10.1
DB_PORT=29208
DB_HOST=10.10.10.1
DB_PORT=22442
DB_HOST=10.10.10.1
DB_PORT=26311
DB_HOST=10.10.10.1
DB_PORT=27396
DB_HOST=10.10.10.1
DB_PORT=29143

Obviously these are environment variables so the container is only taking one of them

After running nomad job stop the live service is removed from the results but the rest remain:

$ nomad service info postgres
Job ID  Address           Tags  Node ID   Alloc ID
netbox  10.10.10.1:27536  []    0a3ab847  0902374c
netbox  10.10.10.1:25029  []    0a3ab847  5509e74e
netbox  10.10.10.1:29208  []    0a3ab847  600dc346
netbox  10.10.10.1:22442  []    0a3ab847  6bdbb8da
netbox  10.10.10.1:27396  []    0a3ab847  f3260e33
netbox  10.10.10.1:29143  []    0a3ab847  f4b75da9

I was unable to fully stop the job, I needed to remove them all manually with nomad service delete ... first before the job would consider itself stopped rather than pending.

Reproduction steps

Set up a job utilising Nomad service discovery on a cluster with one server and then reboot

Expected Result

Only the currently live service is listed and used in the template

Actual Result

Older dead allocations are presented as live and used in the template

Job file (if appropriate)

Excerpt from task config:

...
        task "netbox-postgres" {
            driver = "docker"
            
            service {
                name     = "postgres"
                provider = "nomad"
                port     = "postgres"
                check {
                    type     = "tcp"
                    interval = "5s"
                    timeout  = "1s"
                }
            }
...
@tgross
Copy link
Member

tgross commented Jan 3, 2023

Hi @plasmastorm! Can you clarify something here: you said this was a "dev server". But it wasn't running in -dev mode, right? Because that shouldn't have any persistent data between restarts, so that'd be a bigger problem if so.

But even assuming you're not running in -dev mode, this is a bug for sure and I'll mark this issue for roadmapping. Just a warning that you're going to find that running in "standalone" mode with a server + client in the same node is likely to uncover little problems like this around restarts. We've been discussing investing some time into making this use case more solid to support "standalone" mode.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jan 3, 2023
@plasmastorm
Copy link
Author

Hi @tgross thanks for having a look at this - you're correct in the assumption that I'm not running as -dev, i guess i said dev as it's just me playing around with it and therefore not in production. It's set up as a systemd service using both the server and client stanzas in the config file.

@tgross
Copy link
Member

tgross commented May 14, 2024

I'm going to assign myself this but I believe this is a duplicate of #16616. If anyone has additional information, please first see my comment at #16616 (comment) and report there.

Closing as duplicate.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/duplicate theme/service-discovery/nomad type/bug
Projects
Development

No branches or pull requests

2 participants