mismatch desired state of tasks #1414

camerondavison · 2016-07-12T22:03:08Z

Nomad version

$ nomad version
Nomad v0.4.0

Operating system and Environment details

NA, but vagrant

Issue

Failed allocations with desired state of run stay as desired state run when job is resubmitted

Reproduction steps

Dockerfile

FROM alpine
RUN apk add --update bash && rm -rf /var/cache/apk/*
ADD https://gist.githubusercontent.com/a86c6f7964/045da29e2cc5a59949361aab051eb805/raw/4e7b8d762302d33fc305760ab49553998b762db7/echoer.bash /bin/echoer.bash
ENTRYPOINT ["bash","/bin/echoer.bash"]

docker build -t example:docker .

Job

job "echoer-docker" {
  datacenters = ["dc1"]
  group "group" {
    task "echoer" {
      driver = "docker"
      config {
        image = "example:docker"
      }
      resources {
        cpu = 20
        memory = 30
      }
    }
  }
}

Start job. Then kill it and its image

docker rm -f $(docker ps -a -q) && docker rmi $(docker images -a -q)

This means that the task will transition into the failed state, if the job is in the failed state nomad status echoer-docker shows desired state to be run an the state as failed as expected.
Wait until the job is dead, then run the docker build again to get the image back.
After this resubmit the job to nomad.
If you run nomad status echoer-docker at this point you see.

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
9c8c92d7  acdfe822  c926bedf  group       run      running
1ab36d87  5cd91b1b  c926bedf  group       run      failed

I would expect that at least one of these tasks should have the desired state of stop since the job definition states only 1 should be running at a time.

This is making it difficult for me to monitor nomad for partial jobs running. Where 1 of the tasks in a 2 task job is running, but the other has been marked as failed is not running anymore. Currently I have been looking for tasks with desired state of run that are not running

The text was updated successfully, but these errors were encountered:

diptanu · 2016-07-15T04:24:53Z

The scheduler internally computes whether an allocation is in terminal state or not by looking at both the desired state and client state. In the above case the allocation 1ab36d87 would be perceived by the scheduler as terminal since the client status is in failed.

This is the current state and we are discussing internally if we can get rid of one of the states but it's not super high priority.

Regarding the problem you are trying to solve, very soon there you should be able to summarize the state of the job by a new API and you won't have to look into all the allocations and determine how many allocations are running. Please take a look at #1340, hoping that would work for you?

camerondavison · 2016-07-15T21:06:19Z

I am not sure that #1340 would really help me. I am looking basically for things that are supposed to have 2 instances running, but instead only have 1. I would not want to error on anything in a failed state, since that is somewhat expected. I really only want to alert of things that are supposed to be running, but are in (as you said) a "terminal" state.

The scheduler internally computes whether an allocation is in terminal state or not by looking at both the desired state and client state. In the above case the allocation 1ab36d87 would be perceived by the scheduler as terminal since the client status is in failed.

All this ticket is about is transitioning the task to some other state when the job is re-submitted instead of waiting for it to be GC'd. If I run force/gc on nomad this allocation does get removed because of the "terminal" state that it is in.

Ah which got me thinking. What I am doing is actually not safe at all. If I change the count to 2 and run the with docker rmi -f example:docker instead of the image name then run /v1/system/gc in the middle of the above example I end up with.

$ nomad status echoer
ID          = echoer-docker
Name        = echoer-docker
Type        = service
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
ed3652cb  9b9f2719  688fdcf4  group       run      running

Which without prior knowledge of the group actually being for count 2 looks like everything is happy.
Clearly it is not though

$ nomad plan echoer-docker.nomad
Job: "echoer-docker"
Task Group: "group" (1 create, 1 in-place update)
  Task: "echoer"

Maybe my original thought was wrong about the Desired state needing to be changed, but I feel like it would be nice if some state somewhere was changed to notify that this thing is not running fully like it should be.

mikenomitch · 2022-05-18T00:21:16Z

I don't think we'll change the meaning of desired, but I do think the root ask of this Issue is good.

I'm going to close this in favor of #13053 since I think the new Issue has a bit less noise.

github-actions · 2022-10-07T02:43:03Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/core stage/thinking labels Jul 15, 2016

tgross added stage/needs-discussion and removed stage/thinking labels Aug 24, 2020

mikenomitch mentioned this issue May 18, 2022

Add real time alloc summary for latest job version #13053

Open

mikenomitch closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mismatch desired state of tasks #1414

mismatch desired state of tasks #1414

camerondavison commented Jul 12, 2016

diptanu commented Jul 15, 2016

camerondavison commented Jul 15, 2016

mikenomitch commented May 18, 2022

github-actions bot commented Oct 7, 2022

mismatch desired state of tasks #1414

mismatch desired state of tasks #1414

Comments

camerondavison commented Jul 12, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

diptanu commented Jul 15, 2016

camerondavison commented Jul 15, 2016

mikenomitch commented May 18, 2022

github-actions bot commented Oct 7, 2022