Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatch desired state of tasks #1414

Closed
camerondavison opened this issue Jul 12, 2016 · 4 comments
Closed

mismatch desired state of tasks #1414

camerondavison opened this issue Jul 12, 2016 · 4 comments

Comments

@camerondavison
Copy link
Contributor

Nomad version

$ nomad version
Nomad v0.4.0

Operating system and Environment details

NA, but vagrant

Issue

Failed allocations with desired state of run stay as desired state run when job is resubmitted

Reproduction steps

Dockerfile

FROM alpine
RUN apk add --update bash && rm -rf /var/cache/apk/*
ADD https://gist.githubusercontent.com/a86c6f7964/045da29e2cc5a59949361aab051eb805/raw/4e7b8d762302d33fc305760ab49553998b762db7/echoer.bash /bin/echoer.bash
ENTRYPOINT ["bash","/bin/echoer.bash"]
docker build -t example:docker .

Job

job "echoer-docker" {
  datacenters = ["dc1"]
  group "group" {
    task "echoer" {
      driver = "docker"
      config {
        image = "example:docker"
      }
      resources {
        cpu = 20
        memory = 30
      }
    }
  }
}

Start job. Then kill it and its image

docker rm -f $(docker ps -a -q) && docker rmi $(docker images -a -q)

This means that the task will transition into the failed state, if the job is in the failed state nomad status echoer-docker shows desired state to be run an the state as failed as expected.
Wait until the job is dead, then run the docker build again to get the image back.
After this resubmit the job to nomad.
If you run nomad status echoer-docker at this point you see.

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
9c8c92d7  acdfe822  c926bedf  group       run      running
1ab36d87  5cd91b1b  c926bedf  group       run      failed

I would expect that at least one of these tasks should have the desired state of stop since the job definition states only 1 should be running at a time.

This is making it difficult for me to monitor nomad for partial jobs running. Where 1 of the tasks in a 2 task job is running, but the other has been marked as failed is not running anymore. Currently I have been looking for tasks with desired state of run that are not running

@diptanu
Copy link
Contributor

diptanu commented Jul 15, 2016

The scheduler internally computes whether an allocation is in terminal state or not by looking at both the desired state and client state. In the above case the allocation 1ab36d87 would be perceived by the scheduler as terminal since the client status is in failed.

This is the current state and we are discussing internally if we can get rid of one of the states but it's not super high priority.

Regarding the problem you are trying to solve, very soon there you should be able to summarize the state of the job by a new API and you won't have to look into all the allocations and determine how many allocations are running. Please take a look at #1340, hoping that would work for you?

@camerondavison
Copy link
Contributor Author

I am not sure that #1340 would really help me. I am looking basically for things that are supposed to have 2 instances running, but instead only have 1. I would not want to error on anything in a failed state, since that is somewhat expected. I really only want to alert of things that are supposed to be running, but are in (as you said) a "terminal" state.

The scheduler internally computes whether an allocation is in terminal state or not by looking at both the desired state and client state. In the above case the allocation 1ab36d87 would be perceived by the scheduler as terminal since the client status is in failed.

All this ticket is about is transitioning the task to some other state when the job is re-submitted instead of waiting for it to be GC'd. If I run force/gc on nomad this allocation does get removed because of the "terminal" state that it is in.

Ah which got me thinking. What I am doing is actually not safe at all. If I change the count to 2 and run the with docker rmi -f example:docker instead of the image name then run /v1/system/gc in the middle of the above example I end up with.

$ nomad status echoer
ID          = echoer-docker
Name        = echoer-docker
Type        = service
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
ed3652cb  9b9f2719  688fdcf4  group       run      running

Which without prior knowledge of the group actually being for count 2 looks like everything is happy.
Clearly it is not though

$ nomad plan echoer-docker.nomad
Job: "echoer-docker"
Task Group: "group" (1 create, 1 in-place update)
  Task: "echoer"

Maybe my original thought was wrong about the Desired state needing to be changed, but I feel like it would be nice if some state somewhere was changed to notify that this thing is not running fully like it should be.

@mikenomitch
Copy link
Contributor

I don't think we'll change the meaning of desired, but I do think the root ask of this Issue is good.

I'm going to close this in favor of #13053 since I think the new Issue has a bit less noise.

@mikenomitch mikenomitch closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2022
@github-actions
Copy link

github-actions bot commented Oct 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants