Nomad running more services than desired task count #3198

ummecasino · 2017-09-12T08:33:45Z

Nomad version: 0.6.0

I'm not shure if this is a possible issue or just a question for which I need clarification: We have returning problems with the count of actual deployed services versus the desired count in the job description.

The following service is actually running with 2 instances, the desired count is 1. I try to give all the information can gather, I'll append the job file.
job.txt

nomad deployment status 234255c4
ID          = 234255c4
Job ID      = transform-rueckmeldung
Job Version = 2
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
default     1        3       0        0

nomad status transform-rueckmeldung
ID            = transform-rueckmeldung
Name          = transform-rueckmeldung
Submit Date   = 09/07/17 15:54:56 CEST
Type          = service
Priority      = 50
Datacenters   = integration
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
default     0       0         2        14      40        1

Latest Deployment
ID          = 234255c4
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
default     1        3       0        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
ee5f51f8  4232dd6b  default     2        run      running  09/08/17 11:30:27 CEST
16a4b0d7  4232dd6b  default     2        run      running  09/08/17 11:30:27 CEST

nomad alloc-status ee5f51f8
ID                  = ee5f51f8
Eval ID             = e954dd76
Name                = transform-rueckmeldung.default[0]
Node ID             = 4232dd6b
Job ID              = transform-rueckmeldung
Job Version         = 2
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/08/17 11:30:27 CEST
Deployment ID       = 234255c4
Deployment Health   = unset

Task "transform-rueckmeldung" is "running"
Task Resources
CPU        Memory           Disk     IOPS  Addresses
2/300 MHz  5.2 MiB/256 MiB  300 MiB  0     https: 10.32.108.38:31958

Task Events:
Started At     = 09/08/17 09:30:33 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
09/08/17 11:30:33 CEST  Started     Task started by client
09/08/17 11:30:27 CEST  Task Setup  Building Task Directory
09/08/17 11:30:27 CEST  Received    Task received by client

nomad eval-status e954dd76
ID                 = e954dd76
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = node-update
Node ID            = 4232dd6b-87a3-f56a-ac00-dda7a514828e
Priority           = 50
Placement Failures = false

nomad alloc-status 16a4b0d7
ID                  = 16a4b0d7
Eval ID             = 334b0354
Name                = transform-rueckmeldung.default[0]
Node ID             = 4232dd6b
Job ID              = transform-rueckmeldung
Job Version         = 2
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/08/17 11:30:27 CEST
Deployment ID       = 234255c4
Deployment Health   = unset

Task "transform-rueckmeldung" is "running"
Task Resources
CPU        Memory           Disk     IOPS  Addresses
2/300 MHz  5.2 MiB/256 MiB  300 MiB  0     https: 10.32.108.38:31139

Task Events:
Started At     = 09/08/17 09:30:37 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
09/08/17 11:30:37 CEST  Started     Task started by client
09/08/17 11:30:27 CEST  Task Setup  Building Task Directory
09/08/17 11:30:27 CEST  Received    Task received by client

nomad eval-status 334b0354
ID                 = 334b0354
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = node-update
Node ID            = <none>
Priority           = 50
Placement Failures = false

dadgar · 2017-09-12T16:58:29Z

@ummecasino Can you grab the output of curl http://127.0.0.1:4646/v1/job/<job>/evaluations?pretty=true and /v1/job/<job>/allocations?pretty=true

shantanugadgil · 2017-09-12T17:08:35Z

i haven't gone through the details of the reported issue, but sometimes I see two dockers running instead of one (for any of my services).
This happens if the Docker is upgraded when I do a full system upgrade (yum -y update)

When I reboot my compute machine, the number of Dockers is back to their expected count.

Shantanu

ummecasino · 2017-09-13T07:36:30Z

@dadgar Sorry, I had to redeploy the service because it's in our QA environment, the evaluation and allocation for the concerning have already been garbage collected. (btw. is there something like a best practice to archive allocations for later analysis?)

@shantanugadgil I don't think that this caused the problem, we had no updates etc. in the meantime

dadgar · 2017-09-13T20:51:19Z

@ummecasino No worries, I am fairly confident that I have the fix for what you hit based on a report from another user. What makes me confident is the time stamps that the two allocs were made at are the same and the fact that they are from separate evaluations.

I would grab the relevant allocations and evals using the commands I showed. Often what it requires to debug is different per issue but those + server/client logs in debug are the best bet.

This PR fixes a scheduling race condition in which the plan results from one invocation of the scheduler were not being considered by the next since the Worker was not waiting for the correct index. Fixes #3198

github-actions · 2022-12-08T02:16:40Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/scheduling stage/waiting-reply labels Sep 12, 2017

dadgar mentioned this issue Sep 13, 2017

Worker waits til max ModifyIndex across EvalsByJob #3206

Merged

dadgar closed this as completed in #3206 Sep 14, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad running more services than desired task count #3198

Nomad running more services than desired task count #3198

ummecasino commented Sep 12, 2017 •

edited

Loading

dadgar commented Sep 12, 2017

shantanugadgil commented Sep 12, 2017 •

edited

Loading

ummecasino commented Sep 13, 2017

dadgar commented Sep 13, 2017

github-actions bot commented Dec 8, 2022

Nomad running more services than desired task count #3198

Nomad running more services than desired task count #3198

Comments

ummecasino commented Sep 12, 2017 • edited Loading

dadgar commented Sep 12, 2017

shantanugadgil commented Sep 12, 2017 • edited Loading

ummecasino commented Sep 13, 2017

dadgar commented Sep 13, 2017

github-actions bot commented Dec 8, 2022

ummecasino commented Sep 12, 2017 •

edited

Loading

shantanugadgil commented Sep 12, 2017 •

edited

Loading