Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI plugin expected controllers exceeds actual controller count #12771

Closed
iSchluff opened this issue Apr 25, 2022 · 7 comments · Fixed by #12774
Closed

CSI plugin expected controllers exceeds actual controller count #12771

iSchluff opened this issue Apr 25, 2022 · 7 comments · Fixed by #12774

Comments

@iSchluff
Copy link

Nomad version

Nomad v1.3.0-beta.1 (2eba643)

Operating system and Environment details

Ubuntu 20.04.4 on x64

Issue

Due to operator error I registered csi nodes as controllers, therefore my expected controller count on the plugin is now too high.

#  nomad plugin status
Container Storage Interface
ID        Provider             Controllers Healthy/Expected  Nodes Healthy/Expected
ceph-csi  cephfs.csi.ceph.com  1/4                           3/3

Nomad still shows the plugin as healthy and the volumes as schedulable, however waiting for plugin healthy via terraform fails:

data "nomad_plugin" "ceph" {
  plugin_id        = "ceph-csi"
  wait_for_healthy = true
}
2022-04-25T12:25:32.941+0200 [TRACE] provider.terraform-provider-nomad_v1.4.16_x4: plugin received interrupt signal, ignoring: count=2 timestamp=2022-04-25T12:25:32.941+0200
2022-04-25T12:25:33.009+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [DEBUG] Getting plugin "ceph-csi"...
2022-04-25T12:25:33.065+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [DEBUG] plugin ceph-csi not yet healthy: 1/4 controllers healthy  3/3 nodes healthy
2022-04-25T12:25:33.065+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [TRACE] Waiting 10s before next try
^C2022-04-25T12:25:34.397+0200 [TRACE] provider.terraform-provider-nomad_v1.4.16_x4: plugin received interrupt signal, ignoring: count=3 timestamp=2022-04-25T12:25:34.397+0200
2022-04-25T12:25:35.967+0200 [TRACE] dag/walk: vertex "module.nomad-registry.nomad_volume.registry (expand)" is waiting for "module.nomad-registry.data.nomad_plugin.ceph (expand)"
2022-04-25T12:25:36.275+0200 [TRACE] dag/walk: vertex "module.nomad-registry (close)" is waiting for "module.nomad-registry.nomad_volume.registry (expand)"
2022-04-25T12:25:36.648+0200 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/nomad\"] (close)" is waiting for "module.nomad-registry.nomad_volume.registry (expand)"
2022-04-25T12:25:37.349+0200 [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/nomad\"] (close)"

Is it possible to reduce the expected controller count without recreating the cluster?

Reproduction steps

Start more csi controllers than you need, stop some of them.

Expected Result

expected controller count should probably never exceed 1?

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@tgross tgross self-assigned this Apr 25, 2022
@tgross
Copy link
Member

tgross commented Apr 25, 2022

Hi @iSchluff! I want to make sure I can reproduce this scenario accurately:

Due to operator error I registered csi nodes as controllers, therefore my expected controller count on the plugin is now too high.
...
Start more csi controllers than you need, stop some of them.

So you had a job that was registered as a controller, and then changed it to node and re-ran? Did the old controller allocations get replaced entirely by node allocations or did they get in-place updates?

@iSchluff
Copy link
Author

Yes exactly, I accidentally ran node tasks as controller tasks. The old allocations got replaced by an in-place job update.
Just for completeness: I was already running the tasks as nodes before and replaced the csi configuration and then went back.
So basically in-place node>controller>node

@tgross
Copy link
Member

tgross commented Apr 25, 2022

Ok, thank you. So there's likely two bugs here: the counts aren't being reset properly, but also we should have replaced the tasks entirely in that case and not done an in-place update. I'll see if I can put together a quick repro and report back.

@tgross
Copy link
Member

tgross commented Apr 25, 2022

I was able to reproduce the non-destructive update pretty easily just by switching the csi_plugin.type. #12774 should fix this and prevent the invariants the counts expect from being violated.

@tgross
Copy link
Member

tgross commented Apr 25, 2022

I've merged that fix and it'll ship in the GA release of Nomad 1.3.0. Thanks for opening this issue @iSchluff!

@iSchluff
Copy link
Author

thanks for the quick response

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants