Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

Closed
kainoaseto opened this issue Jun 2, 2020 · 8 comments
Closed

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

kainoaseto opened this issue Jun 2, 2020 · 8 comments

Comments

@kainoaseto
Copy link

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

1 or 3 Nomad servers (have tested with both sizes of clusters)

Issue

After deregistering a volume, the CSIVolumeGC evaluation will continue to run to check that volume and will fail with "volume not found". This happens consistently on the cluster I've been using to do CSI testing on and it seems like it's being persisted in the Raft state somewhere since I've tried restart the cluster, resizing the cluster, even modifying the evaluation code to always pass on these volume failures but upon restarting with 0.11.2 code these old volumes will continue to fail to be GC'd.

This was noticed when it took down our development servers since we had quite a few volumes we deregistered and on server startup the leader will try to process all of them and run out of CPU.

Reproduction steps

  1. Follow the guide here

  2. Run nomad volume deregister mysql

  3. The Nomad server logs will periodically have the errors below with seemingly no way to stop them.

Nomad Server logs (if appropriate)

nomad.fsm: CSIVolumeClaim failed: error="volume not found: mysql"
worker: error invoking scheduler: error="failed to process evaluation: volume not found: mysql"
@tgross
Copy link
Member

tgross commented Jun 8, 2020

Hey @kainoaseto we just release 0.11.3 which has some improvements to the GC loop. Can you give that a try to see if that can clean these up?

@tsarna
Copy link

tsarna commented Jul 18, 2020

I'm running a mix of 0.11.3 and 0.12.0 currently. The leader is 0.12.0 at the moment, and I see in its logs:

2020-07-18T20:26:52.021Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "799729e4-90b9-9646-4fe2-6b2a39e26508" JobID: "csi-volume-claim-gc:data-test" Namespace: "default">"
2020-07-18T20:26:52.380Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume not found: data-test"

@tgross
Copy link
Member

tgross commented Aug 7, 2020

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

@kainoaseto
Copy link
Author

Thank you @tgross ! This is a really exciting development and we are really looking forward to testing out CSI again when 0.12.2 drops. We really appreciate the follow up on these issues and all the work you have all done to stabilize CSI, this is what keeps us coming back to Nomad time and again.

@tsarna
Copy link

tsarna commented Aug 7, 2020

Thank you!

Just to note: the volume I see in the log message above doesn't appear in nomad volume status, so I don't know if nomad volume detach would help me, but hopefully one of the other changes will fix it (#8605 perhaps?)

@tgross
Copy link
Member

tgross commented Aug 10, 2020

I've closed #8285, #8145, and #8057 as duplicates of this issue; I'll continue to collect status updates here as we wrap up testing for 0.12.2.

@tgross
Copy link
Member

tgross commented Aug 11, 2020

Testing for 0.12.2 looks good. Going to close this issue out, and 0.12.2 will be shipped shortly.

@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants