Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad running more services than desired task count #3198

Closed
ummecasino opened this issue Sep 12, 2017 · 5 comments · Fixed by #3206
Closed

Nomad running more services than desired task count #3198

ummecasino opened this issue Sep 12, 2017 · 5 comments · Fixed by #3206

Comments

@ummecasino
Copy link

ummecasino commented Sep 12, 2017

Nomad version: 0.6.0

I'm not shure if this is a possible issue or just a question for which I need clarification: We have returning problems with the count of actual deployed services versus the desired count in the job description.

The following service is actually running with 2 instances, the desired count is 1. I try to give all the information can gather, I'll append the job file.
job.txt

nomad deployment status 234255c4
ID          = 234255c4
Job ID      = transform-rueckmeldung
Job Version = 2
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
default     1        3       0        0
nomad status transform-rueckmeldung
ID            = transform-rueckmeldung
Name          = transform-rueckmeldung
Submit Date   = 09/07/17 15:54:56 CEST
Type          = service
Priority      = 50
Datacenters   = integration
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
default     0       0         2        14      40        1

Latest Deployment
ID          = 234255c4
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
default     1        3       0        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
ee5f51f8  4232dd6b  default     2        run      running  09/08/17 11:30:27 CEST
16a4b0d7  4232dd6b  default     2        run      running  09/08/17 11:30:27 CEST
nomad alloc-status ee5f51f8
ID                  = ee5f51f8
Eval ID             = e954dd76
Name                = transform-rueckmeldung.default[0]
Node ID             = 4232dd6b
Job ID              = transform-rueckmeldung
Job Version         = 2
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/08/17 11:30:27 CEST
Deployment ID       = 234255c4
Deployment Health   = unset

Task "transform-rueckmeldung" is "running"
Task Resources
CPU        Memory           Disk     IOPS  Addresses
2/300 MHz  5.2 MiB/256 MiB  300 MiB  0     https: 10.32.108.38:31958

Task Events:
Started At     = 09/08/17 09:30:33 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
09/08/17 11:30:33 CEST  Started     Task started by client
09/08/17 11:30:27 CEST  Task Setup  Building Task Directory
09/08/17 11:30:27 CEST  Received    Task received by client
nomad eval-status e954dd76
ID                 = e954dd76
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = node-update
Node ID            = 4232dd6b-87a3-f56a-ac00-dda7a514828e
Priority           = 50
Placement Failures = false
nomad alloc-status 16a4b0d7
ID                  = 16a4b0d7
Eval ID             = 334b0354
Name                = transform-rueckmeldung.default[0]
Node ID             = 4232dd6b
Job ID              = transform-rueckmeldung
Job Version         = 2
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/08/17 11:30:27 CEST
Deployment ID       = 234255c4
Deployment Health   = unset

Task "transform-rueckmeldung" is "running"
Task Resources
CPU        Memory           Disk     IOPS  Addresses
2/300 MHz  5.2 MiB/256 MiB  300 MiB  0     https: 10.32.108.38:31139

Task Events:
Started At     = 09/08/17 09:30:37 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
09/08/17 11:30:37 CEST  Started     Task started by client
09/08/17 11:30:27 CEST  Task Setup  Building Task Directory
09/08/17 11:30:27 CEST  Received    Task received by client
nomad eval-status 334b0354
ID                 = 334b0354
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = node-update
Node ID            = <none>
Priority           = 50
Placement Failures = false
@dadgar
Copy link
Contributor

dadgar commented Sep 12, 2017

@ummecasino Can you grab the output of curl http://127.0.0.1:4646/v1/job/<job>/evaluations?pretty=true and /v1/job/<job>/allocations?pretty=true

@shantanugadgil
Copy link
Contributor

shantanugadgil commented Sep 12, 2017

i haven't gone through the details of the reported issue, but sometimes I see two dockers running instead of one (for any of my services).
This happens if the Docker is upgraded when I do a full system upgrade (yum -y update)

When I reboot my compute machine, the number of Dockers is back to their expected count.

Shantanu

@ummecasino
Copy link
Author

@dadgar Sorry, I had to redeploy the service because it's in our QA environment, the evaluation and allocation for the concerning have already been garbage collected. (btw. is there something like a best practice to archive allocations for later analysis?)

@shantanugadgil I don't think that this caused the problem, we had no updates etc. in the meantime

@dadgar
Copy link
Contributor

dadgar commented Sep 13, 2017

@ummecasino No worries, I am fairly confident that I have the fix for what you hit based on a report from another user. What makes me confident is the time stamps that the two allocs were made at are the same and the fact that they are from separate evaluations.

I would grab the relevant allocations and evals using the commands I showed. Often what it requires to debug is different per issue but those + server/client logs in debug are the best bet.

dadgar added a commit that referenced this issue Sep 13, 2017
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.

Fixes #3198
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.

Fixes #3198
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.

Fixes #3198
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.

Fixes #3198
@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants