Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI volume keeps references to failed allocations #8145

Closed
mkrueger-sabio opened this issue Jun 10, 2020 · 10 comments
Closed

CSI volume keeps references to failed allocations #8145

mkrueger-sabio opened this issue Jun 10, 2020 · 10 comments

Comments

@mkrueger-sabio
Copy link

Nomad version

0.11.3

Issue

  1. I build an own CSI plugin which mounts volumes from Gluster
  2. I run the CSI plugin and registered a new volume
  3. I run a job which uses the volume

I don't know what I have done afterwards. Probably, I removed the volume although the job was still in pending. The problem is now, that I cannot remove the volume because it is still in pending and it stays there.

stuck

I have stopped the plugin and the job but the allocation is still there. It is only visible in the UI. When I query the allocation with the nomad client I cannot find it.

I tried to run gc and restart the client and server, but nothing happens.

How can I remove the allocation?

@mkrueger-sabio
Copy link
Author

The problem seems to be that allocations from failed jobs are not removed from the volume. I could reproduce the problem with these steps:

  1. Register a volume
  2. Run a job which uses the volume
  3. The job failed for some reason and is in state pending or failed.
  4. Stop the job with nomad stop -purge

=> The volume still holds the allocation for the job.

In my case I registered a volume which could not be mounted because a volume for the external id does not exist.

@mkrueger-sabio mkrueger-sabio changed the title allocation got stuck in state pending CSI volume keeps references to failed allocations Jun 12, 2020
@tgross tgross self-assigned this Jun 22, 2020
@tgross
Copy link
Member

tgross commented Jun 22, 2020

Hi @mkrueger-sabio! Thanks for opening this issue. This is definitely unexpected behavior and I'll be digging into this.

It is only visible in the UI. When I query the allocation with the nomad client I cannot find it.

This is an interesting detail. Can the volume be claimed by a new allocation at this point, or does the Nomad server still think it has a claim? Nevermind, I see the below:

In my case I registered a volume which could not be mounted because a volume for the external id does not exist.

So what we end up with is a Nomad-registered volume, that has no physical counterpart, but because of that it can't clean up the allocs that claimed it? It shouldn't be possible to write the claim in that case, but that may be where the bug is.

@anthonymq
Copy link

I encountered quite the same problems but with "running" allocations that i stopped.
#8285

@RickyGrassmuck
Copy link
Contributor

Running into the same issue. Volumes reference a non-existent allocation and are unable to be removed. Not sure of anyway to manually force these volumes outta existance (the deregister force option doesn't work unfortunately) so I'm assuming they will likely be stuck there until a fix is released.

@tgross
Copy link
Member

tgross commented Jul 24, 2020

Hey folks, just FYI we shipped a nomad volume deregister -force in 0.12.0 which might help you out here. In the meantime, we're getting ramped up to wrap up these remaining CSI issues over the next couple weeks so hopefully we should have some progress for you soon.

@mkrueger-sabio
Copy link
Author

Thank, this helps to remove a lot of volumes.

I still have the problem that I cannot remove a volume which has a pending allocation. But the allocation does not exist anymore.

@tgross
Copy link
Member

tgross commented Jul 27, 2020

Understood. I'm pretty sure I know what's going on there and I'm working on a fix for this set of problems.

@tgross
Copy link
Member

tgross commented Aug 7, 2020

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

@tgross
Copy link
Member

tgross commented Aug 10, 2020

For sake of our planning, I'm going to close this issue. We'll continue to track progress of this set of problems in #8100.

@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants