Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service registry not being updated for alloc restarts #13802

Closed
mr-karan opened this issue Jul 18, 2022 · 3 comments
Closed

Service registry not being updated for alloc restarts #13802

mr-karan opened this issue Jul 18, 2022 · 3 comments
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug
Milestone

Comments

@mr-karan
Copy link
Contributor

Nomad version

Output from nomad version

Nomad v1.3.2 (bf602974112964e9691729f3f0716ff2bcdb3b44)

Operating system and Environment details

Linux pop-os 5.17.15-76051715-generic #202206141358~1655919116~22.04~1db9e34 SMP PREEMPT Wed Jun 22 19 x86_64 x86_64 x86_64 GNU/Linux

Issue

Service registry fails to "register" a new service when an "in-place" allocation restart happens.

Reproduction steps

nomad run redis.nomad

View that all allocations are running:

nomad job status redis
---
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
092b31c1  ae8b6482  cache       0        run      running  6s ago   5s ago
b3d63736  ae8b6482  cache       0        run      running  6s ago   5s ago
c799a433  ae8b6482  cache       0        run      running  6s ago   5s ago

View the status of services:

curl -s http://localhost:4646/v1/service/redis-cache | jq -r '.[].ID'

_nomad-task-092b31c1-28b3-238b-e6b3-103181bc01fe-group-cache-redis-cache-db
_nomad-task-b3d63736-97c4-06c2-ab48-6b26945410a4-group-cache-redis-cache-db
_nomad-task-c799a433-bba8-66f7-94c7-592deba9c056-group-cache-redis-cache-db

Restart an alloc:

nomad alloc restart 092b31c1

View the status of services now (only 2 are there, instead of expected 3):

curl -s http://localhost:4646/v1/service/redis-cache | jq -r '.[].ID'

_nomad-task-b3d63736-97c4-06c2-ab48-6b26945410a4-group-cache-redis-cache-db
_nomad-task-c799a433-bba8-66f7-94c7-592deba9c056-group-cache-redis-cache-db

Expected Result

All 3 service registrations should be there, even if the alloc does an "in=place" restart.

Actual Result

Only 2 service registrations are there, the 3rd alloc which was restarted has a missing service.

Job file (if appropriate)

job "redis" {
  datacenters = ["dc1"]

  type = "service"

  group "cache" {
    count = 3

    network {
      port "db" {
        to = 6379
      }
    }

    service {
      provider = "nomad"
      name     = "redis-cache"
      tags = [
        "external-dns/hostname=redis.test.internal",
        "external-dns/ttl=30s",
      ]
      port = "db"

    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"

        ports = ["db"]
      }
    }
  }
}

NOTE: I was listening to Nomad Events when the alloc was restarted and found this:

{"Index":173,"Events":[{"Topic":"Service","Type":"ServiceDeregistration","Key":"_nomad-task-092b31c1-28b3-238b-e6b3-103181bc01fe-group-cache-redis-cache-db","Namespace":"default","FilterKeys":["redis","redis-cache"],"Index":173,"Payload":{"Service":{"ID":"_nomad-task-092b31c1-28b3-238b-e6b3-103181bc01fe-group-cache-redis-cache-db","ServiceName":"redis-cache","Namespace":"default","NodeID":"ae8b6482-badf-eb7c-f451-6777f31d700c","Datacenter":"dc1","JobID":"redis","AllocID":"092b31c1-28b3-238b-e6b3-103181bc01fe","Tags":["external-dns/hostname=redis.test.internal","external-dns/ttl=30s"],"Address":"127.0.0.1","Port":29986,"CreateIndex":160,"ModifyIndex":160}}}]}

So a ServiceDeregistration is emitted but there's no ServiceRegistration created when the alloc restarts. This bug may make Nomad Services a bit unreliable to depend on, if an alloc restarts.

@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 18, 2022
@mikenomitch mikenomitch added this to the 1.4.0 milestone Jul 18, 2022
@mikenomitch
Copy link
Contributor

Hey @mr-karan, thanks for the report. This seems like something important to correct. We'll take a look soon on the engineering side and hopefully get the fix into a 1.3.X release if not 1.4.

@shoenig shoenig added theme/service-discovery/nomad stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jul 18, 2022
@tgross tgross removed this from Needs Triage in Nomad - Community Issues Triage Jul 25, 2022
@shoenig
Copy link
Member

shoenig commented Sep 26, 2022

Pretty sure this was fixed between #14127 / #14009 which addressed some pretty old / bad logic around allocation restarts.

@shoenig shoenig closed this as completed Sep 26, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug
Projects
None yet
Development

No branches or pull requests

4 participants