Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck allocation on dead job #806

Closed
far-blue opened this issue Feb 16, 2016 · 7 comments
Closed

Stuck allocation on dead job #806

far-blue opened this issue Feb 16, 2016 · 7 comments

Comments

@far-blue
Copy link

I'm new to all this so maybe I've just missed something but I appear to have an orphan allocation from a dead job that failed to completely start.

Context: Running v0.3.0rc1 in the dev environment created by the included vagrantfile. Running in --dev' mode (dual agent/client mode).

I started with a modified version of the example.nomad file created with nomad init and I modified the task to run a mysql container and added a second task to run an apache container. I started the job with nomad run but it failed to complete because I'd typo'd the apache container image name.

At this point I had a mysql container running but no apache container. So I edited the job to correct my typo and called nomad run again. My understanding was that it would evaluate the difference and just start the apache container (because the mysql container was already running).

However, it actually re-evaluated the entire job and started both the apache container and a second mysql container, while leaving the original container running. Note that I have not changed the name of the job or the task group (I left them as example and cache, as per the original job config).

So I called nomad stop thinking it would clean everything up but it only stopped the new containers, leaving the original mysql container. I thought maybe nomad had 'forgotten' about it so killed it with Docker directly - but nomad put it back.

So now I have a mysql container that nomad is keeping alive but no job to control it with.

> nomad status example
No job(s) with prefix or id "example" found
> docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
266885ac1ee4        mysql:latest        "/entrypoint.sh mysql"   33 minutes ago      Up 33 minutes       127.0.0.1:23968->3306/tcp, 127.0.0.1:23968->3306/udp   mysql-bc506dfd-6351-ab4e-ad23-95c3fd971baa
> nomad alloc-status bc506dfd
ID              = bc506dfd
Eval ID         = 3226e9b9
Name            = example.cache[0]
Node ID         = f8e6eacc
Job ID          = example
Client Status   = failed
Evaluated Nodes = 1
Filtered Nodes  = 0
Exhausted Nodes = 0
Allocation Time = 2.21072ms
Failures        = 0

==> Task "apache" is "dead"
Recent Events:
Time                   Type            Description
16/02/16 21:33:52 UTC  Driver Failure  failed to start: Failed to pull `apache:latest`: Error: image library/apache not found

==> Task "mysql" is "running"
Recent Events:
Time                   Type        Description
16/02/16 21:49:12 UTC  Started     <none>
16/02/16 21:48:39 UTC  Terminated  Exit Code: 0
16/02/16 21:34:23 UTC  Started     <none>

==> Status
Allocation "bc506dfd" status "failed" (0/1 nodes filtered)
  * Score "f8e6eacc-46f7-18b0-df52-350346732e60.binpack" = 7.683003

So I'm not quite sure what to do next and I'm pretty certain this is not expected behaviour.

Any thoughts anyone?

@diptanu
Copy link
Contributor

diptanu commented Feb 16, 2016

@far-blue I could reproduce this, and thanks for reporting.

@dgshep
Copy link

dgshep commented Feb 25, 2016

I ran into this exact issue when trying out 0.3.0-rc2. As far as I can tell the only way to clear out the orphaned allocation is to clobber the nomad servers and remove all existing state :/

@dgshep
Copy link

dgshep commented Mar 7, 2016

@diptanu, is there someone actively working on this? If not I would be willing to take a crack at it.

@diptanu
Copy link
Contributor

diptanu commented Mar 17, 2016

@dgshep Yes! We might be able to tackle this in the next release.

@dgshep
Copy link

dgshep commented Mar 17, 2016

Very cool. BTW Congrats on the C1M project! Stellar stuff...

@rickardrosen
Copy link

I am seeing this on Nomad v0.5.4.

I had a job that no longer exists with an allocation stuck on a node, trying to pull a container that no longer exists and receiving a 400 from the registry.
It's been doing this for a couple of weeks without getting cleaned up, so tonight I decided to restart the nomad agent which allowed the task to be killed.

Is it a regression, or have I triggered something completely new for some reason?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants