Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad services linger with invalid allocIDS #17182

Closed
SamMousa opened this issue May 15, 2023 · 4 comments
Closed

Nomad services linger with invalid allocIDS #17182

SamMousa opened this issue May 15, 2023 · 4 comments

Comments

@SamMousa
Copy link
Contributor

Nomad version

Output from nomad version

Nomad v1.5.5
BuildDate 2023-05-05T12:50:14Z
Revision 3d63bc62b35cbe3f79cdd245d50b61f130ee1a79

Operating system and Environment details

Running Ubuntu 22.04.2LTS, 3 nodes.
This is not yet a full production cluster, mostly running support workloads where some downtime is acceptable.

Issue

We use Traefik and it's Nomad service discovery for routing traffic. Sometimes we notice a bad gateway for a service that according to Nomad is running just fine.
Diving into this we tried purging the job from Nomad (with the intention of running it after everything is cleaned up).
After purging the job we noticed the service in Traefik still persisted, so it was time to look a little deeper.

> nomad service list
Service Name              Tags
grafana                   [domains=grafana.xxx,traefik.enable=true,traefik.http.routers.grafana.rule=Host(`grafana.xxx`)]
grafana-unified-alerting  []

> nomad service info grafana
Job ID   Address             Tags                                                                                                               Node ID   Alloc ID
grafana  192.168.40.5:22881  [traefik.enable=true,traefik.http.routers.grafana.rule=Host(`grafana.xxx`),domains=grafana.xxx]  da3a4b3f  39c4a690

> nomad alloc status 39c4a690
No allocation(s) with prefix or id "39c4a690" found

So the situation, summarized as I understand it:

  • Nomad won't purge the grafana job. (Maybe because it thinks there are still allocations)
  • The service record points to an allocation that no longer exists
  • I've found no way to recover from this situation

Reproduction steps

Don't know

Nomad Server logs (if appropriate)

May 15 08:09:38 xxx nomad[1465]:     2023-05-15T08:09:38.128Z [ERROR] nomad.fsm: DeleteServiceRegistrationByID failed: error="service registration not found"


@SamMousa
Copy link
Contributor Author

Possible duplicate of #16762, the script mentioned here: #16762 (comment) solved my issue.

Looking into how that works I'm thinking the manual service delete via CLI would have worked as well. It just isn't obvious that the service ID was not grafana.. Would it not make sense to print the service ID as well when using nomad service list?

As my current issue is resolved I propose someone takes a look at the first post and decides if it has relevant information for fixing the underlying bug. If it doesn't feel free to close this issue.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 15, 2023
@tgross
Copy link
Member

tgross commented May 15, 2023

Hi @SamMousa! I agree this most likely sounds like another case of #16762.

It just isn't obvious that the service ID was not grafana.. Would it not make sense to print the service ID as well when using nomad service list?

The IDs for the services are very long (ex one running on my machine right now is _nomad-task-d9f65cc9-c7cc-45b2-c0be-53d6fe84e62b-group-web-httpd-www-www), so that'd be challenging to present in the CLI UI in a way that's legible. But I suspect we could probably make nomad service delete smarter about being able to delete orphaned services without a specific ID.

As far as this bug goes can you clarify this bit?:

Nomad won't purge the grafana job. (Maybe because it thinks there are still allocations)

Was the job still present (that is, visible via nomad job status graphana) or just the service (via nomad service list)?

@tgross tgross self-assigned this May 15, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage May 15, 2023
@SamMousa
Copy link
Contributor Author

After purging the job the job and Service were still visible. The UI showed no allocations for the job but going to the services for the job shows a Service and a nonexistent allocation

@tgross
Copy link
Member

tgross commented May 17, 2023

Ok, I'm going to close this as a duplicate of #17079 so that we can centralize our efforts around that. We've got a release coming out very soon with the patch.

@tgross tgross closed this as completed May 17, 2023
Nomad - Community Issues Triage automation moved this from Triaging to Done May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants